Hugging Face

Hugging Face can be used on YCRC systems to run large language models through Python scripts or Jupyter notebooks. This approach provides explicit control over model loading, precision, and GPU placement and is appropriate for scripted workflows and exploratory analysis.

Intended usage

Hugging Face is appropriate for: - scripted inference workflows - Jupyter notebook - workflows requiring explicit control over model loading and precision - multi-GPU inference on a single node - Retrieval Augmented Generation - Fine Tuning

Ollama vs Hugging Face

The YCRC supports all AI/ML workflows. However, we generally recommend Ollama because it is simpler to use as a beginner. Hugging Face provides significantly more flexibility in model development and control and is suitable for high level users. Hugging Face also provides access to more models than Ollama.

Ultimately, there is no wrong choice. If you run into any issues with your workflows, please contact research.computing@yale.edu.

Environment setup

Hugging Face must be installed inside a Miniconda environment. Environments must be created on a compute node.

salloc --partition=devel --cpus-per-task=2 --time=1:00:00 --mem=32G
module load miniconda
conda create --name huggingface python=3.11 transformers accelerate tokenizers datasets notebook
conda activate huggingface

PyTorch must be installed separately. Make sure to install a version with CUDA.

To make the environment available in Open OnDemand Jupyter:

module reset
ycrc_conda_env.sh update

Interactive usage

Interactive Hugging Face workflows are best run using the Jupyter Notebook application in Open OnDemand.

Scripts may also be executed directly from the command line within an interactive allocation.

GPU selection and constraints

By default, Slurm assigns the lowest-memory GPU available in the selected partition. Many LLMs require more vRAM than the smallest available GPU.

If a model requires more GPU memory, specify a GPU constraint explicitly.

Interactive example of requesting a specific GPU:

salloc --partition=gpu_devel --constraint="rtx5000|v100"

Batch script example of requesting a specific GPU:

#SBATCH --constraint="rtx5000|v100"

Model loading and precision

Automatic dtype selection during model loading is discouraged. Using automatic dtype selection can result in: - inefficient memory usage - unexpected precision choices - failure to load on GPUs with limited vRAM

Recommended practice: - explicitly specify model precision (fp16,int8, etc) - validate memory usage on the target GPU type

Precision selection should be guided by the GPU memory available and the model size.

Multi-GPU usage on a single node

Requesting multiple GPUs does not automatically distribute a model across devices. Multi-GPU usage requires explicit configuration in user code.

When using multiple GPUs on a single node: - ensure all GPUs are allocated in the Slurm request - configure the framework to shard or distribute the model - validate that all GPUs are being used

Please see here for instructions on multinode jobs.

After execution, confirm GPU memory usage and utilization using Jobstats.

Note: Failing to use all requested GPUs will result in cancellation of your jobs by Jobstats to ensure resource availability for other workflows.

Batch usage

Once a workflow is validated interactively, it can be converted to a batch job.

For notebook-based workflows, Jupyter notebooks can be executed non-interactively using papermill. Guidance is available under Command-Line Execution of Jupyter Notebooks.

For script-based workflows, submit the script directly using a Slurm batch script.

The YCRC does not provide examples for HuggingFace workflows as they change based on the model desired by the user.

Validation

After running a Hugging Face job: - confirm the job requested the expected GPU resources - inspect GPU memory usage and utilization using Jobstats - confirm the number of active GPUs matches the request - verify that the model loaded at the intended precision

Last update: January 30, 2026