Job Performance Monitoring
We have recently deployed a new tool for measuring and monitoring job performance called jobstats
.
Available on all clusters, jobstats
provides a report of the utilization of CPU, Memory, and GPU resources for in-progress and recently completed jobs.
To generate the report simply run (replacing the ID number of the job in question):
[ab123@grace ~]$ jobstats 123456789
================================================================================
Slurm Job Statistics
================================================================================
Job ID: 123456789
NetID/Account: ab123/group
Job Name: gpu_job
State: RUNNING
Nodes: 1
CPU Cores: 1
CPU Memory: 5GB
GPUs: 1
QOS/Partition: normal/gpu
Cluster: grace
Start Time: Tue Nov 26, 2024 at 2:10 PM
Run Time: 20:09:56 (in progress)
Time Limit: 2-00:00:00
Overall Utilization
================================================================================
CPU utilization [||||||||||||||||||||||||||||||||||||||||||||||100%]
CPU memory usage [ 1%]
GPU utilization [|||||||||||||||||||||||||||||||||||||||||||||||98%]
GPU memory usage [| 3%]
Detailed Utilization
================================================================================
CPU utilization per node (CPU time used/run time)
r808u11n01: 20:07:01/20:09:56 (efficiency=99.8%)
CPU memory usage per node - used/allocated
r808u11n01: 32.4MB/5.0GB (32.4MB/5.0GB per core of 1)
GPU utilization per node
r808u11n01 (GPU 1): 98.3%
GPU memory usage per node - maximum used/total
r808u11n01 (GPU 1): 689.6MB/24.0GB (2.8%)
Notes
================================================================================
* Have a nice day!
When viewed from a web-browser, these statistics are enhanced with plots of performance over time.
This is a great way to monitor your job's behavior and resource utilization over time.
Last update: November 27, 2024