Skip to content

Available Resources and Recommendations

YCRC research computing infrastructure can be used to run localized large language models on GPU-enabled clusters.

Running LLMs locally provides the following advantages:

  • user data remains local to the cluster (Security concerns with non-local models)
  • researchers can control model versions and configurations
  • YCRC GPUs are free of charge (Except the secure data cluster, Hopper
  • models up to 1100 GB of aggregate GPU memory can be run on a single node (Bouchet Only)

GPU availability on YCRC resources

Once an LLM workflow is configured, models can be run on any GPU partition:

  • McCleary and Grace: gpu_devel, gpu, gpu_scavenge
  • Milgram: gpu, scavenge
  • Bouchet: gpu, gpu_h200, gpu_devel
  • Hopper: gpu

GPU memory capacity varies significantly by GPU type. Some models will not run unless sufficient vRAM is available.

All YCRC GPU nodes, except H200s (contain 8 GPUs) contain four GPUs. When a job requests four GPUs on the same node, the available GPU memory is the sum of all four devices.

For example:

  • requesting four A100-80GB GPUs provides 320 GB of aggregate GPU memory

Cluster GPU summary

Cluster Largest GPU Max vRAM (4 GPUs) # of Largest GPUs available # of Other GPUs available Workflow Recommendations
Bouchet H200 1120 GB 80 40 Very large models, multi-GPU inference, large-scale experimentation
Hopper H200 1120 GB 32 172 HIPPA, PHI, PII Data Analysis
Grace A100-80GB 320 GB 16 132 Large inference workloads, memory-bound models
McCleary A100-80GB 320 GB 12 92 General-purpose inference, development, testing
Milgram H100 320 GB 12 8 Medium Risk Data Workflows

Detailed hardware information is available on the cluster pages:

Navigate to the Public Partitions section and select gpu or gpu_devel to view available hardware.

Scheduling considerations

YCRC operates on a queued scheduling system. Larger GPUs are in higher demand and typically have longer wait times.

Recommendations:

  • begin development using smaller models that fit on lower-memory GPUs
  • validate workflows interactively
  • scale model size only after confirming correctness

This approach reduces queue time and avoids failed jobs.

Choosing a local LLM

Models are available through:

Model names commonly include parameter counts:

  • 7B = 7 billion parameters
  • 13B, 30B, 70B, etc.

Approximate GPU memory requirements for inference without quantization:

Parameter size Inference vRAM (GB)
7B 10+
13B 20+
30B 40+
70B 80+
305B 400+

Exact requirements are listed on the model's Hugging Face or Ollama page.

Domain-specific models are often smaller and more efficient than general-purpose models and should be preferred when available.

If vRAM requirements are uncertain, run the model on a larger GPU and inspect usage using Jobstats.

Advanced methods (overview only)

More complex workflows require additional GPU memory:

Method vRAM Required
Inference Size of model
RAG Additional 10-30+ GB than inference
Fine-tuning 3-5x inference
Fine-tuning (QLoRA) 10-20% of inference
Training from scratch Not Recommended

Last update: January 30, 2026