Jobs with Dependencies
SLURM offers a tool which can help string jobs together via dependencies. When submitting a job, you can specify that it should wait to run until a specified job has finished. This provides a mechanism to create simple pipelines for managing complicated workflows.
As a toy example, consider a two-step pipeline, first a data transfer followed by an analysis step.
Here we will use the
--dependency flag for sbatch and the
afterok type that requires a job to finish successfully before starting the second step:
The first step is controlled by a
sbatch submission script called
#!/bin/bash #SBATCH --job-name=DataTransfer #SBATCH -t 30:00 rsync -avP remote_host:/path/to/data.csv $HOME/project/
#!/bin/bash #SBATCH --job-name=DataProcess #SBATCH -t 5:00:00 module load miniconda source activate my_env python my_script.py $HOME/project/data.csv
sbatch step1.sh) we obtain the jobid number for that job. We then submit the second step adding in the
--dependencyflag to tell Slurm that this job requires the first job to finish before it can start:
sbatch --dependency=afterok:56761133 step2.sh
When the 'transfer' job finishes successfully (without an error exit code) the 'processing' step will begin. While this is a simple dependency structure, it is possible to have multiple dependencies or more complicated structure.
One frequent use-case is a clean-up job that runs after all other jobs have finished.
This is a common way to collect results from processing multiple files into a single output file.
This can be done using the
--dependency=singleton:<job_id> flag that will wait until all previously launched jobs with the same name and user have finished.
[tl397@grace1 ~]$ squeue -u tl397 JOBID PARTITION NAME USER ST SUBMIT_TIME NODELIST(REASON) 12345670 day JobName tl397 R 2020-05-27T11:54 c01n08 12345671 day JobName tl397 R 2020-05-27T11:54 c01n08 ... 12345678 day JobName tl397 R 2020-05-27T11:54 c01n08 12345679 day JobName tl397 R 2020-05-27T11:54 c01n08 [tl397@grace1 ~]$ sbatch --dependency=singleton --job-name=JobName cleanup.sh [tl397@grace1 ~]$ squeue -u tl397 JOBID PARTITION NAME USER ST SUBMIT_TIME NODELIST(REASON) 12345670 day JobName tl397 R 2020-05-27T11:54 c01n08 12345671 day JobName tl397 R 2020-05-27T11:54 c01n08 ... 12345678 day JobName tl397 R 2020-05-27T11:54 c01n08 12345679 day JobName tl397 R 2020-05-27T11:54 c01n08 12345680 day JobName tl397 R 2020-05-27T11:54 (Dependency)
This last job will wait to run until all previous jobs with name
SLURM provides a number of options for logic controlling dependencies.
Most common are the two discussed above, but
--dependency=afternotok:<job_id> can be useful to control behavior if a job fails.
Full discussion of the options can be found on the SLURM manual page for
A very detailed overview, with examples in both bash and python, can also be found at the NIH computing reference: https://hpc.nih.gov/docs/job_dependencies.html.