Skip to content

Practical Example: Python+TensorFlow

Notes on handson-ml2 for YCRC clusters Grace & Farnam


  • Basic understanding of command line shell commands and syntax
  • A Yale NetID
  • An active Yale VPN connection
  • A YCRC Cluster account


  1. Log in to OOD for the cluster you have an account on with NetID and Password.

  2. Start a new interactive shell.

    "Shell" > "Clustername Shell Access"

    shell navigation menu

  3. Start an interactive job.

    salloc -c 2 -p interactive
    salloc Argument Description
    -c 2 Allocate two CPU cores (and default 5GiB RAM per core)
    -p interactive Start the job on the interactive partition (meant for work like this)
  4. Clone the repo from GitHub.

    mkdir -p ~/repos
    cd ~/repos
    git clone
  5. Create the Conda environment used in the book, index it for the YCRC Open Ondemand web dashboard.

    module load miniconda
    cd ~/repos/handson-ml2
    # environment creation will take a few minutes
    conda env create -f environment.yml
    conda activate tf2
    # swap out tf with gpu-enabled tf
    pip uninstall tensorflow
    pip install tensorflow-gpu==2.4.1
    ycrc_conda_env.list build

Start a Notebook

  1. Log in to OOD for the cluster you have an account on with NetID and Password.

  2. Navigate to the Jupyter job submission form.

    "Interactive Apps" > "Jupyter Notebook"

    interactive apps navigation menu

  3. Set up the request.

    GPUs and their needs

    Not all partitions have GPUs, they must be requested explicitly, and you need to load the right version of CUDA + cuDNN.

    To make this environment work properly in your jupyter job you need to

    • Choose your tf2 environment from the "Environment Setup" dropdown

      conda environment choice

    • Set "Number of GPUs per node" to at least 1.

      GPU choice

    • Set a partition with GPUs available.

      GPU partition

    • Load the right cuDNN module (will also load a CUDA module). For this environment, that's cuDNN/ .

      cuDNN module choice

      What version of CUDA?

      If you are not installing things for yourself or don't know what CUDA version you need, try importing GPU-enabled tensorflow and look at the output:

      Python 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37)
      [GCC 9.3.0] on linux
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import tensorflow as tf
      2021-03-26 08:50:40.092738: W tensorflow/stream_executor/platform/default/] Could not load dynamic library''; dlerror: cannot open shared object file: No such file or directory
      2021-03-26 08:50:40.092780: I tensorflow/stream_executor/cuda/] Ignore above cudart dlerror if you do nothave a GPU set up on your machine.
      Notice it complains about not finding The numbers at the end of this shared object usually correspond ot a CUDA version, so here we will use CUDA 11.

    Then click the Launch button. Depending on how busy the cluster is your job may need to wait in the queue.

  4. Connect to the Jupyter Server.

    Jupyter job connection button

Containers - tensorflow/serving

You can use TensorFlow Serving to serve multiple versions of models but examples given use Ubuntu packages or Docker. YCRC clusters are RHEL and do not allow Docker, so instead we use Apptainer

docker pull tensorflow/serving

export ML_PATH=$HOME/ml # or wherever this project is
docker run -it --rm -p 8500:8500 -p 8501:8501 \
   -v "$ML_PATH/my_mnist_model:/models/my_mnist_model" \
   -e MODEL_NAME=my_mnist_model \
apptainer build tf_model_server.sif docker://tensorflow/serving

export ML_PATH=$HOME/repos/handson-ml2
apptainer run --containall -B "$ML_PATH/my_mnist_model:/models/my_mnist_model" \
    --env MODEL_NAME=my_mnist_model tf_model_server.sif
  • Only need to build .sif file when you would run docker pull
  • Don't need to forward ports out of container
  • --containall makes apptainer run more like docker, allows finer control of what files container sees

Heterogeneous Job Layouts

If you want to allocate multiple node types in a single job, e.g. a larger first task/worker for a tf "chief" or a CPU-only parameter server.

#SBATCH --partition=gpu --time=1- --ntasks=1 --cpus-per-task=4
#SBATCH --mem-per-cpu=8G --gpus-per-task=1 --job-name=tf2-cluster-chief
#SBATCH hetjob
#SBATCH --ntasks=8 --cpus-per-task=2 
#SBATCH --mem-per-cpu=16G --gpus-per-task=1

module load miniconda cuDNN/
conda activate tf2

# this python process will start on a node in the gpu partition with 4 CPUs & 1 GPU
#SBATCH --partition=day --time=1- --ntasks=1 --cpus-per-task=2
#SBATCH --mem-per-cpu=8G --job-name=tf2-cluster-par-serv
#SBATCH hetjob
#SBATCH --partition=gpu --ntasks=8 --cpus-per-task=2 
#SBATCH --mem-per-cpu=16G --gpus-per-task=1

module load miniconda cuDNN/
conda activate tf2

# this python process will start on a node in the day partition with 2 CPUs

Last update: September 9, 2022