Available Resources and Recommendations

YCRC research computing infrastructure can be used to run localized large language models on GPU-enabled clusters.

Running LLMs locally provides the following advantages:

user data remains local to the cluster (Security concerns with non-local models)
researchers can control model versions and configurations
YCRC GPUs are free of charge (Except the secure data cluster, Hopper
models up to 1100 GB of aggregate GPU memory can be run on a single node (Bouchet Only)

GPU availability on YCRC resources

Once an LLM workflow is configured, models can be run on any GPU partition:

McCleary and Grace: gpu_devel, gpu, gpu_scavenge
Milgram: gpu, scavenge
Bouchet: gpu, gpu_h200, gpu_devel
Hopper: gpu

GPU memory capacity varies significantly by GPU type. Some models will not run unless sufficient vRAM is available.

All YCRC GPU nodes, except H200s (contain 8 GPUs) contain four GPUs. When a job requests four GPUs on the same node, the available GPU memory is the sum of all four devices.

For example:

requesting four A100-80GB GPUs provides 320 GB of aggregate GPU memory

Cluster GPU summary

Cluster	Largest GPU	Max vRAM (4 GPUs)	# of Largest GPUs available	# of Other GPUs available	Workflow Recommendations
Bouchet	H200	1120 GB	80	40	Very large models, multi-GPU inference, large-scale experimentation
Hopper	H200	1120 GB	32	172	HIPPA, PHI, PII Data Analysis
Grace	A100-80GB	320 GB	16	132	Large inference workloads, memory-bound models
McCleary	A100-80GB	320 GB	12	92	General-purpose inference, development, testing
Milgram	H100	320 GB	12	8	Medium Risk Data Workflows

Detailed hardware information is available on the cluster pages:

Navigate to the Public Partitions section and select gpu or gpu_devel to view available hardware.

Scheduling considerations

YCRC operates on a queued scheduling system. Larger GPUs are in higher demand and typically have longer wait times.

Recommendations:

begin development using smaller models that fit on lower-memory GPUs
validate workflows interactively
scale model size only after confirming correctness

This approach reduces queue time and avoids failed jobs.

Choosing a local LLM

Models are available through:

Model names commonly include parameter counts:

7B = 7 billion parameters
13B, 30B, 70B, etc.

Approximate GPU memory requirements for inference without quantization:

Parameter size	Inference vRAM (GB)
7B	10+
13B	20+
30B	40+
70B	80+
305B	400+

Exact requirements are listed on the model's Hugging Face or Ollama page.

Domain-specific models are often smaller and more efficient than general-purpose models and should be preferred when available.

If vRAM requirements are uncertain, run the model on a larger GPU and inspect usage using Jobstats.

Advanced methods (overview only)

More complex workflows require additional GPU memory:

Method	vRAM Required
Inference	Size of model
RAG	Additional 10-30+ GB than inference
Fine-tuning	3-5x inference
Fine-tuning (QLoRA)	10-20% of inference
Training from scratch	Not Recommended

Last update: January 30, 2026