Tips for NASA Pleiades#

This section provides general tips for setting up NASA NAS account and running GCHP on Pleiades.

Account setup#

NASA provides detailed walk-through NASA Account Setup:

  • The difference among LDAP and launchpad passwords, PIN and passcode:

    • LDAP password is for logging on sfe and pfe/lfe node

    • Launchpad password is the password for logging on id.nasa.gov

    • PIN is the password set for RSA SecurID

    • Passcode is the instantaneous password generated by RSA SecurID

  • Setting up public key and SSH passthrough would be helpful to make subsequent logging process easier:

    • Instructions: NASA SSH Passthrough

    • Setting up SSH passthrough requires linux-based terminal. Windows users may need to resort to terminal such as Cygwin

    • Tips: keep the Cygwin installer for the sake of future package installation such as vim (Cygwin does not install vim by default)

    • Compute1 may lose the added SSH Passthrough to NASA after re-logging. We can manually add it to .bash_profile with:

      # add for nasa
      eval `ssh-agent -s`
      ssh-add ~/.ssh/id_rsa
      
  • Differences between sfe, pfe, and lfe

    • sfe will be only used for logging into NASA NAS system

    • pfe is ususally where we land on for compiling and submitting GCHP jobs

    • lfe is usually where we store massive data, such as restart files and outputs from GCHP simulations

Note

/nobackup filesystem is mounted on lfe as well, so we can also submit GCHP jobs on lfe.

Shiftc data transferring tool#

  • Instructions for local transfer (within NASA NAS system): shiftc local transfer

  • Instructions for installing shiftc on other clusters (e.g. Compute1): shiftc remote transfer

    • Add sup to your $PATH. For example, if your sup is located at $HOME/bin/sup, then add export PATH=$PATH:$HOME/bin to .bash_profile and lsf-conf.rc

    • The command of sup shiftc will expire every 604800s. We can check by using such as sup shiftc --status --state=run on Compute1 home node

    • Transferring outside NASA system needs to be initiated from remote cluster, i.e., using sup shiftc on remote cluster to transfer files from/to NASA system

  • Transferrring between Compute1 and NASA by batch jobs

    1. Installing shiftc on home node of Compute1 is also required for batch jobs

    2. There is also an available container (docker(registry.gsc.wustl.edu/sleong/bbftp))

    3. Add tail -f /dev/null for batch data transferring on Compute1 to avoid losing connection to clusters.

      Then manually kill the compute1 job when transferring finished.

      An example:

      #!/bin/bash
      #BSUB -n 1
      #BSUB -R "rusage[mem=50G] span[ptile=1] select[mem < 500GB] order[-slots]"
      #BSUB -q rvmartin
      #BSUB -a 'docker(1dandan/netcdf-utils:latest)'
      #BSUB -N
      #BSUB -u <your_wustl_key>@wustl.edu
      #BSUB -o transfer-%J.txt
      #BSUB -J "transfer"
      
      cd /my-projects
      sup shiftc pfe:/nobackup/dzhang8/GEOSChem.ACAG.20180101*.nc4 .
      # try use sup shiftc --hosts=8 --sync -r when you try to transfer a directory containing many large files
      # --sync will make sure it will not transfer existing files
      # -r will transfer directories recursively
      # --hosts=8 will use 8 parallel threads to transfer files
      tail -f /dev/null
      

Note

Transferring data (restarts and outputs from GCHP) from pfe to lfe would be very helpful to reduce the amount of storage we need on pfe

Running GCHP on Pleiades#

  • GCHP environment: source the environment script by source /u/yzhang52/gchp-intel.202304.env to compile or run your GCHP (Compilation should be done on compute node)

  • Example running script can be found at /u/yzhang52/gchp.run.pbs

Note

# PBS -W group_list=<your-project-id>. Project id and usage are shown by acct_ytd.

  • NASA Pleiades system uses PBS for job scheduling. Commonly used PBS commands can be found at PBS Commands

  • Real-time usage of different clusters on NASA Pleiades can be monitored at NASA System Status (Note it will take several minutes for the website to be updated)

Note

Another way to check the real-time vacancies of different node types is node_stats.sh (already in your PATH). An example of node_stats.sh output:

Nodes currently allocated to the devel queue:
bro     :   Intel Broadwell Total:  110, Used:   65, Free:   45
cas_ait : Intel Cascadelake Total:   64, Used:   11, Free:   53
has     :     Intel Haswell Total:  145, Used:   96, Free:   49
ivy     :   Intel Ivybridge Total:  406, Used:  303, Free:  103
rom_ait :          AMD Rome Total:   69, Used:   69, Free:    0
sky_ele :     Intel Skylake Total:   20, Used:   10, Free:   10
SBU rate per node type: bro:1.0 bro_ele:1.0 cas_ait:1.64 cas_gpu:27.04 has:0.8 ivy:0.66 mil_a100:37.86 mil_ait:4.38 rom_ait:4.06 rom_gpu:75.72 sky_ele:1.59 sky_gpu:27.04
FY2024 SBU cost == $0.22/SBU

Intel-processor nodes bro, cas_ait, has and sky_ele are top choices for GCHP simulations. The detailed descriptions (like core counts per node) can be found at NASA Node Types in the PBS on <Cluster>` section.

  • Model inputs /ExtData

Note

There is no /ExtData like what we have on Compute1, but there are some customized downloaded inputs as follows:

Sebastian has downloaded multiple required inputs at /nobackup/seastham/ExtData/ (no longer available)

Dandan has downloaded required inputs for simulations in 2018 and 2019 at /nobackup/dzhang8/ExtData/ (no longer available)

  • You have to download the inputs you need through AWS, WashU data portal or transfer using shiftc to /nobackup/<your_username>/ExtData/ before running GCHP

Processing outputs on Pleiades#

  • Specific data analysis node: Lou Data Analysis Nodes (LDAN) can be used for postprocessing data (e.g. GCHP diagonostics)

  • Python environment: source the environment script by source /u/yzhang52/python-gchp.env

  • Need to bring data to disk before processing data on lfe to avoid unpredictable time stuck for I/O processes (see bring data to disk)

Note

It would help save space on Pleiades by first checking whether inputs you need are available or not and only downloading inputs you need.