Practical Example: Python+TensorFlow
Notes on installing and testing tensorflow for YCRC clusters Grace & McCleary
Prerequisites
- Basic understanding of command line shell commands and syntax
- A Yale NetID
- An active Yale VPN connection
- A YCRC Cluster account
Setup
-
Log in to OOD for the cluster you have an account on with NetID and Password.
-
Start a new interactive shell.
"Shell" > "Clustername Shell Access"
-
Start an interactive job.
salloc -c 2
salloc
ArgumentDescription -c 2
Allocate two CPU cores (and default 5GiB RAM per core) -
Create the Conda environment for tensorflow, index it for the YCRC Open Ondemand web dashboard.
module load miniconda mamba create --name tf2 python=3.9 cudatoolkit=11.2.2 cudnn=8.1.0 jupyter jupyterlab # environment creation will take a few minutes conda activate tf2 # Do not install tensorflow with conda, it is no longer supported pip install tensorflow==2.11.* # configure system paths to point to python libraries mkdir -p $CONDA_PREFIX/etc/conda/activate.d echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh #run command to add to list of jupyter note book environments ycrc_conda_env.list build
Start a Notebook
-
Log in to OOD for the cluster you have an account on with NetID and Password.
-
Navigate to the Jupyter job submission form.
"Interactive Apps" > "Jupyter Notebook"
-
Set up the request.
GPUs and their needs
Not all partitions have GPUs, they must be requested explicitly, and you need to load the right version of CUDA + cuDNN.
To make this environment work properly in your jupyter job you need to
-
Choose your
tf2
environment from the "Environment Setup" dropdown -
Set "Number of GPUs per node" to at least 1.
-
Set a partition with GPUs available.
Then click the Launch button. Depending on how busy the cluster is your job may need to wait in the queue.
-
-
Connect to the Jupyter Server.
Containers - tensorflow/serving
You can use TensorFlow Serving to serve multiple versions of models but examples given use Ubuntu packages or Docker. YCRC clusters are RHEL and do not allow Docker, so instead we use Apptainer
docker pull tensorflow/serving
export ML_PATH=$HOME/ml # or wherever this project is
docker run -it --rm -p 8500:8500 -p 8501:8501 \
-v "$ML_PATH/my_mnist_model:/models/my_mnist_model" \
-e MODEL_NAME=my_mnist_model \
tensorflow/serving
apptainer build tf_model_server.sif docker://tensorflow/serving
export ML_PATH=$HOME/repos/handson-ml2
apptainer run --containall -B "$ML_PATH/my_mnist_model:/models/my_mnist_model" \
--env MODEL_NAME=my_mnist_model tf_model_server.sif
- Only need to build
.sif
file when you would rundocker pull
- Don't need to forward ports out of container
--containall
makes apptainer run more like docker, allows finer control of what files container sees
Heterogeneous Job Layouts
If you want to allocate multiple node types in a single job, e.g. a larger first task/worker for a tf "chief" or a CPU-only parameter server.
#!/bin/bash
#SBATCH --partition=gpu --time=1- --ntasks=1 --cpus-per-task=4
#SBATCH --mem-per-cpu=8G --gpus-per-task=1 --job-name=tf2-cluster-chief
#SBATCH hetjob
#SBATCH --ntasks=8 --cpus-per-task=2
#SBATCH --mem-per-cpu=16G --gpus-per-task=1
module purge
module load miniconda
conda activate tf2
# this python process will start on a node in the gpu partition with 4 CPUs & 1 GPU
python cluster_tf.py
#!/bin/bash
#SBATCH --partition=day --time=1- --ntasks=1 --cpus-per-task=2
#SBATCH --mem-per-cpu=8G --job-name=tf2-cluster-par-serv
#SBATCH hetjob
#SBATCH --partition=gpu --ntasks=8 --cpus-per-task=2
#SBATCH --mem-per-cpu=16G --gpus-per-task=1
module purge
module load miniconda
conda activate tf2
# this python process will start on a node in the day partition with 2 CPUs
python cluster_tf.py