Pass Values into Jobs
A useful tool when running jobs on the clusters is to be able to pass variables into a script without modifying any code. This can include specifying the name of a data file to be processed, or setting a variable to a specific value. Generally, there are two ways of achieving this: environment variables and command-line arguments. Here we will work through how to implement these two approaches in both Python and R.
Python
Environment Variables
In python, environment variables are accessed via the os
package (docs page).
In particular, we can use os.getenv
to retrieve environment variables set prior to launching the python script.
For example, consider a python script designed to process a data file:
def file_cruncher(file_name):
f = open(file_name)
data = f.read()
output = process(data)
# processing code goes here
return output
We can use an environment variable (INPUT_DATA_FILE
) to provide the filename of the data to be processed.
The python script (my_script.py
) is modified to retrieve this variable and analyze the given datafile:
import os
file_name = os.getenv("INPUT_DATA_FILE")
def file_cruncher(file_name):
f = open(file_name)
data = f.read()
output = process(data)
# processing code goes here
return output
To process this data file, you would simply run:
export INPUT_DATA_FILE=/path/to/file/input_0.dat
python my_script.py
This avoids having to modify the python script to change which datafile is processed, we only need to change the environment variable.
Command-line Arguments
Similarly, one can use command-line arguments to pass values into a script.
In python, there are two main packages designed for handling arguments.
First is the simple sys.argv
function which parses command-line arguments into a list of strings:
import sys
for a in sys.argv:
print(a)
Running this with a few arguments:
$ python my_script.py a b c
my_script.py
a
b
c
sys.argv
is the name of the script, and then all subsequent arguments follow.
Secondly, there is the more fully-featured argparse
package (docs page)which offers many advanced tools to manage command-line arguments.
Take a look at their documentation for examples of how to use argparse
.
R
Just as with Python, R provides comparable utilities to access command-line arguments and environment variables.
Environment Variables
The Sys.getenv
utility (docs page) works nearly identically to the Python implementation.
> Sys.getenv('HOSTNAME')
[1] "grace2.grace.hpc.yale.internal"
Just like Python, these values are always returned as string
representations, so if the variable of interest is a number it will need to be cast into an integer using as.numeric()
.
Command-line Arguments
To collect command-line arguments in R use the commandArgs
function:
args = commandArgs(trailingOnly=TRUE)
for (x in args){
print(x)
}
The trailingOnly=TRUE
option will limit args
to contain only those arguments which follow the script:
Rscript my_script.R a b c
[1] "a"
[1] "b"
[1] "c"
There is a more advanced and detailed package for managing command-line arguments called optparse
(docs page).
This can be used to create more featured scripts in a similar way to Python's argparse
.
Slurm Environment Variables
Slurm sets a number of environment variables detailing the layout of every job. These include:
SLURM_JOB_ID
: the unique jobid given to each job. Useful to set unique output directoriesSLURM_CPUS_PER_TASK
: the number of CPUs allocated for each task. Useful as a replacement for R'sdetectCores
or Python'smultiprocessing.cpu_count
which report the physical number of CPUs and not the number allocated by Slurm.SLURM_ARRAY_TASK_ID
: the unique array index for each element of a job arrays (for a specific example, see here). Useful to un-roll a loop or to set a unique random seed for parallel simulations.
These can be leveraged within batch scripts using the above techniques to either pass on the command-line or directly reading the environment variable to control how a script runs.
For example, if a script previously looped over values ranging from 0-9, we can modify the script and create a job array which runs each iteration separately in parallel using SLURM_ARRAY_TASK_ID
to tell each element of the job array which value to use.