Tips for NASA Pleiades#
This section provides general tips for setting up NASA NAS account and running GCHP on Pleiades.
Account setup#
NASA provides detailed walk-through NASA Account Setup:
The difference among LDAP and launchpad passwords, PIN and passcode:
LDAP password is for logging on sfe and pfe/lfe node
Launchpad password is the password for logging on id.nasa.gov
PIN is the password set for RSA SecurID
Passcode is the instantaneous password generated by RSA SecurID
Setting up public key and SSH passthrough would be helpful to make subsequent logging process easier:
Instructions: NASA SSH Passthrough
Setting up SSH passthrough requires linux-based terminal. Windows users may need to resort to terminal such as Cygwin
Tips: keep the Cygwin installer for the sake of future package installation such as vim (Cygwin does not install vim by default)
Compute1 may lose the added
SSH Passthroughto NASA after re-logging. We can manually add it to.bash_profilewith:# add for nasa eval `ssh-agent -s` ssh-add ~/.ssh/id_rsa
Differences between
sfe,pfe, andlfesfewill be only used for logging into NASA NAS systempfeis ususally where we land on for compiling and submitting GCHP jobslfeis usually where we store massive data, such as restart files and outputs from GCHP simulations
Note
/nobackup filesystem is mounted on lfe as well, so we can also submit GCHP jobs on lfe.
Shiftc data transferring tool#
Instructions for local transfer (within NASA NAS system): shiftc local transfer
Instructions for installing shiftc on other clusters (e.g. Compute1): shiftc remote transfer
Add
supto your$PATH. For example, if yoursupis located at$HOME/bin/sup, then addexport PATH=$PATH:$HOME/binto.bash_profileandlsf-conf.rcThe command of
sup shiftcwill expire every 604800s. We can check by using such assup shiftc --status --state=runon Compute1 home nodeTransferring outside NASA system needs to be initiated from remote cluster, i.e., using
sup shiftcon remote cluster to transfer files from/to NASA system
Transferrring between Compute1 and NASA by batch jobs
Installing
shiftcon home node of Compute1 is also required for batch jobsThere is also an available container (
docker(registry.gsc.wustl.edu/sleong/bbftp))Add
tail -f /dev/nullfor batch data transferring on Compute1 to avoid losing connection to clusters.Then manually kill the compute1 job when transferring finished.
An example:
#!/bin/bash #BSUB -n 1 #BSUB -R "rusage[mem=50G] span[ptile=1] select[mem < 500GB] order[-slots]" #BSUB -q rvmartin #BSUB -a 'docker(1dandan/netcdf-utils:latest)' #BSUB -N #BSUB -u <your_wustl_key>@wustl.edu #BSUB -o transfer-%J.txt #BSUB -J "transfer" cd /my-projects sup shiftc pfe:/nobackup/dzhang8/GEOSChem.ACAG.20180101*.nc4 . # try use sup shiftc --hosts=8 --sync -r when you try to transfer a directory containing many large files # --sync will make sure it will not transfer existing files # -r will transfer directories recursively # --hosts=8 will use 8 parallel threads to transfer files tail -f /dev/null
Note
Transferring data (restarts and outputs from GCHP) from pfe to lfe would be very helpful to reduce the amount of storage we need on pfe
Running GCHP on Pleiades#
GCHP environment: source the environment script by
source /u/yzhang52/gchp-intel.202304.envto compile or run your GCHP (Compilation should be done on compute node)Example running script can be found at
/u/yzhang52/gchp.run.pbs
Note
# PBS -W group_list=<your-project-id>. Project id and usage are shown by acct_ytd.
NASA Pleiades system uses PBS for job scheduling. Commonly used PBS commands can be found at PBS Commands
Real-time usage of different clusters on NASA Pleiades can be monitored at NASA System Status (Note it will take several minutes for the website to be updated)
Note
Another way to check the real-time vacancies of different node types is node_stats.sh (already in your PATH).
An example of node_stats.sh output:
Nodes currently allocated to the devel queue:
bro : Intel Broadwell Total: 110, Used: 65, Free: 45
cas_ait : Intel Cascadelake Total: 64, Used: 11, Free: 53
has : Intel Haswell Total: 145, Used: 96, Free: 49
ivy : Intel Ivybridge Total: 406, Used: 303, Free: 103
rom_ait : AMD Rome Total: 69, Used: 69, Free: 0
sky_ele : Intel Skylake Total: 20, Used: 10, Free: 10
SBU rate per node type: bro:1.0 bro_ele:1.0 cas_ait:1.64 cas_gpu:27.04 has:0.8 ivy:0.66 mil_a100:37.86 mil_ait:4.38 rom_ait:4.06 rom_gpu:75.72 sky_ele:1.59 sky_gpu:27.04
FY2024 SBU cost == $0.22/SBU
Intel-processor nodes bro, cas_ait, has and sky_ele are top choices for GCHP simulations. The detailed descriptions (like core counts per node) can be found at NASA Node Types in the PBS on <Cluster>` section.
Model inputs
/ExtData
Note
There is no /ExtData like what we have on Compute1, but there are some customized downloaded inputs as follows:
Sebastian has downloaded multiple required inputs at /nobackup/seastham/ExtData/ (no longer available)
Dandan has downloaded required inputs for simulations in 2018 and 2019 at /nobackup/dzhang8/ExtData/ (no longer available)
You have to download the inputs you need through AWS, WashU data portal or transfer using shiftc to
/nobackup/<your_username>/ExtData/before running GCHP
Processing outputs on Pleiades#
Specific data analysis node: Lou Data Analysis Nodes (LDAN) can be used for postprocessing data (e.g. GCHP diagonostics)
Python environment: source the environment script by
source /u/yzhang52/python-gchp.envNeed to bring data to disk before processing data on
lfeto avoid unpredictable time stuck for I/O processes (see bring data to disk)
Note
It would help save space on Pleiades by first checking whether inputs you need are available or not and only downloading inputs you need.