ClusterShell
ClusterShell is a useful Python package for executing arbitrary commands across multiple hosts. On the Yale clusters it provides a relatively simple way for you to run commands on nodes your jobs are running on, and collect the results. The two most useful commands provided are nodeset
, which can show and manipulate node lists and clush
, which can run commands on multiple nodes at once.
Configuration
To set up ClusterShell, make sure you have a .config
directory and a copy our groups.conf
file there. For more info about ClusterShell configuration for Slurm, see the official docs.
mkdir -p ~/.config/clustershell
wget https://docs.ycrc.yale.edu/_static/files/clustershell_groups.conf -O ~/.config/clustershell/groups.conf
We provide ClusterShell as a module, but you can also install it with conda
.
Module
module load ClusterShell
Conda
module load miniconda
conda create -yn clustershell python pip
source activate clustershell
pip install ClusterShell
Examples
nodeset
The nodeset
command uses sinfo
underneath but has slightly different syntax. You can use it to ask about node states and nodes your job is running on. The nice difference is you can ask for folded (e.g. c[01-02]n[12,15,18]
) or expanded (e.g. c01n01 c01n02 ...
) node lists. The groups useful to you that we have configured are @user
, @job
and @state
.
User group
List expanded node names where user abc123 has jobs running
# similar to squeue -h -u abc123 -o "%N"
nodeset -e @user:abc123
Job group
List folded nodes where job 1234567 is running
# similar to squeue -h -j 1234567 -o "%N"
nodeset -f @job:1234567
State group
List expanded node names that are idle according to slurm
# similar to sinfo -t IDLE -o "%N"
nodeset -e @state:idle
clush
The clush
command uses the node grouping syntax from nodeset to allow you to run commands on those nodes. clush
uses ssh to connect to each of these nodes. You can use the -b
option to gather output from nodes with same output into the same lines. Leaving this out will report on each node separately.
Info
You can only ssh to, and therefore run clush
on, nodes where you have active jobs.
Local storage
Get a list of files in /tmp/abs on all nodes where job 654321
is running.
clush -bw @job:654321 ls /tmp/abc123
# don't gather identical output
clush -w @job:654321 ls /tmp/abc123
CPU usage
Show %cpu, memory usage, and command for all nodes running any jobs owned by user abc123
.
clush -bw @user:abc123 ps -uabc123 -o%cpu,rss,cmd
GPU usage
Show what's running on all the GPUs on the nodes associated with your job 654321
.
clush -bw @job:654321 nvidia-smi --format=csv --query-compute-apps=process_name,used_gpu_memory