ClusterShell is a useful Python package for executing arbitrary commands across multiple hosts. On the Yale clusters it provides a relatively simple way for you to run commands on nodes your jobs are running on, and collect the results. The two most useful commands provided are
nodeset, which can show and manipulate node lists and
clush, which can run commands on multiple nodes at once.
mkdir -p ~/.config/clustershell wget https://docs.ycrc.yale.edu/files/clustershell_groups.conf -O ~/.config/clustershell/groups.conf
We provide ClusterShell as a module, but you can also install it with
module load ClusterShell
module load miniconda conda create -yn clustershell python pip source activate clustershell pip install ClusterShell
nodeset command uses
sinfo underneath but has slightly different syntax. You can use it to ask about node states and nodes your job is running on. The nice difference is you can ask for folded (e.g.
c[01-02]n[12,15,18]) or expanded (e.g.
c01n01 c01n02 ...) node lists. The groups useful to you that we have configured are
List expanded node names where user abc123 has jobs running
# similar to squeue -h -u abc123 -o "%N" nodeset -e @user:abc123
List folded nodes where job 1234567 is running
# similar to squeue -h -j 1234567 -o "%N" nodeset -f @job:1234567
List expanded node names that are idle according to slurm
# similar to sinfo -t IDLE -o "%N" nodeset -e @state:idle
clush command uses the node grouping syntax from nodeset to allow you to run commands on those nodes.
clush uses ssh to connect to each of these nodes. You can use the
-b option to gather output from nodes with same output into the same lines. Leaving this out will report on each node separately.
You can only ssh to, and therefore run
clush on, nodes where you have active jobs.
Get a list of files in /tmp/abs on all nodes where job
654321 is running.
clush -bw @job:654321 ls /tmp/abc123 # don't gather identical output clush -w @job:654321 ls /tmp/abc123
Show %cpu, memory usage, and command for all nodes running any jobs owned by user
clush -bw @user:abc123 ps -uabc123 -o%cpu,rss,cmd
Show what's running on all the GPUs on the nodes associated with your job
clush -bw @job:654321 nvidia-smi --format=csv --query-compute-apps=process_name,used_gpu_memory