Ruddle
Ruddle is intended for use only on projects related to the Yale Center for Genome Analysis; Please do not use this cluster for other projects. If you have any questions about this policy, please contact us.
Ruddle is named for Frank Ruddle, a Yale geneticist who was a pioneer in genetic engineering and the study of developmental genetics.
System Status and Monitoring
For system status messages and the schedule for upcoming maintenance, please see the system status page. For a current node-level view of job activity, see the cluster monitor page (VPN only).
Partitions and Hardware
Ruddle is made up of several kinds of compute nodes. We group them into (sometimes overlapping) Slurm partitions meant to serve different purposes. By combining the --partition
and --constraint
Slurm options you can more finely control what nodes your jobs can run on.
Job Submission Rate Limits
Job submissions are limited to 200 jobs per hour. See the Rate Limits section in the Common Job Failures page for more info.
Public Partitions
See each tab below for more information about the available common use partitions.
Use the general partition for most batch jobs. This is the default if you don't specify one with --partition
.
Request Defaults
Unless specified, your jobs will run with the following options to srun
and sbatch
options for this partition.
--time=7-00:00:00 --nodes=1 --ntasks=1 --cpus-per-task=1 --mem-per-cpu=5120
Job Limits
Jobs submitted to the general partition are subject to the following limits:
Limit | Value |
---|---|
Max job time limit | 30-00:00:00 |
Maximum CPUs per user | 300 |
Maximum memory per user | 1800G |
Available Compute Nodes
Requests for --cpus-per-task
and --mem
can't exceed what is available on a single compute node.
Nodes | CPU Type | CPUs/Node | Memory/Node (GiB) | Node Features |
---|---|---|---|---|
154 | E5-2660_v3 | 20 | 119 | haswell, avx2, E5-2660_v3, oldest |
Use the interactive partition to jobs with which you need ongoing interaction. For example, exploratory analyses or debugging compilation.
Request Defaults
Unless specified, your jobs will run with the following options to srun
and sbatch
options for this partition.
--time=1-00:00:00 --nodes=1 --ntasks=1 --cpus-per-task=1 --mem-per-cpu=5120
Job Limits
Jobs submitted to the interactive partition are subject to the following limits:
Limit | Value |
---|---|
Max job time limit | 2-00:00:00 |
Maximum CPUs per user | 20 |
Maximum memory per user | 256G |
Available Compute Nodes
Requests for --cpus-per-task
and --mem
can't exceed what is available on a single compute node.
Nodes | CPU Type | CPUs/Node | Memory/Node (GiB) | Node Features |
---|---|---|---|---|
154 | E5-2660_v3 | 20 | 119 | haswell, avx2, E5-2660_v3, oldest |
Use the bigmem partition for jobs that have memory requirements other partitions can't handle.
Request Defaults
Unless specified, your jobs will run with the following options to srun
and sbatch
options for this partition.
--time=1-00:00:00 --nodes=1 --ntasks=1 --cpus-per-task=1 --mem-per-cpu=5120
Job Limits
Jobs submitted to the bigmem partition are subject to the following limits:
Limit | Value |
---|---|
Max job time limit | 7-00:00:00 |
Maximum CPUs per user | 32 |
Maximum memory per user | 1505G |
Available Compute Nodes
Requests for --cpus-per-task
and --mem
can't exceed what is available on a single compute node.
Nodes | CPU Type | CPUs/Node | Memory/Node (GiB) | Node Features |
---|---|---|---|---|
2 | E7-4809_v3 | 32 | 1505 | haswell, avx2, E7-4809_v3 |
Use the scavenge partition to run preemptable jobs on more resources than normally allowed. For more information about scavenge, see the Scavenge documentation.
Request Defaults
Unless specified, your jobs will run with the following options to srun
and sbatch
options for this partition.
--time=1-00:00:00 --nodes=1 --ntasks=1 --cpus-per-task=1 --mem-per-cpu=5120
Job Limits
Jobs submitted to the scavenge partition are subject to the following limits:
Limit | Value |
---|---|
Max job time limit | 7-00:00:00 |
Maximum CPUs per user | 300 |
Maximum memory per user | 1800G |
Available Compute Nodes
Requests for --cpus-per-task
and --mem
can't exceed what is available on a single compute node.
Nodes | CPU Type | CPUs/Node | Memory/Node (GiB) | Node Features |
---|---|---|---|---|
2 | E7-4809_v3 | 32 | 1505 | haswell, avx2, E7-4809_v3 |
154 | E5-2660_v3 | 20 | 119 | haswell, avx2, E5-2660_v3, oldest |
YCGA Data Retention Policy
Illumina sequence data is initially written to YCGA's main storage system, which is located in the main HPC datacenter at Yale's West Campus. Data stored there is protected against loss by software RAID. Raw basecall data (bcl files) is immediately transformed into DNA sequences (fastq files).
- 45 days after sequencing, the raw bcl files are deleted.
- 60 days after sequencing, the fastq files are written to a tape archive. Two tape libraries store identical copies of the data, located in two datacenters in separate buildings on West Campus.
- 365 days after sequencing, all data is deleted from main storage. Users continue to have access to the data via the tape archive. Data is retained on the tape archive indefinitely. Instructions for retrieving archived data.
All compression of sequence data is lossless. Gzip is used for data stored on the main storage, and quip is used for data stored on the tape archive. Disaster recovery is provided by the data stored on the tape library.
Access Sequencing Data
To avoid duplication of data and to save space that counts against your quotas, we suggest that you make soft links to your sequencing data rather than copying them.
Normally, YCGA will send you an email informing you that your data is ready, and will include a url that looks like: http://fcb.ycga.yale.edu:3010/randomstring/sample_dir_001
You can use that link to download your data in a browser, but if you plan to process the data on Ruddle, it is better to make a soft link to the data, rather than copying it. You can use the ycgaFastq tool to do that:
$ /home/bioinfo/software/knightlab/bin_Mar2018/ycgaFastq fcb.ycga.yale.edu:3010/randomstring/sample_dir_001
ycgaFastq can also be used to retrieve data that has been archived to tape. The simplest way to do that is to provide the sample submitter's netid and the flowcell (run) name:
$ ycgaFastq rdb9 AHFH66DSXX
If you have a path to the original location of the sequencing data, ycgaFastq can retrieve the data using that, even if the run as been archived and deleted:
$ ycgaFastq /ycga-gpfs/sequencers/illumina/sequencerD/runs/190607_A00124_0104_AHLF3MMSXX/Data/Intensities/BaseCalls/Unaligned-2/Project_Lz438
ycgaFastq can be used in a variety of other ways to retrieve data. For more information, see the documentation or contact us.
If you would like to know the true location of the data on Ruddle, do this:
$ cd /ycga-gpfs/project/fas/lsprog/tools/external/data/randomstring/sample_dir_001
$ ls -l
Tip
Original sequence data are archived pursuant to the YCGA retention policy. For long-running projects we recommend you keep a personal backup of your sequence files. If you need to retrieve archived sequencing data, please see our guide on how to do so.
If you have a very old link from YCGA that doesn't use the random string, you can find the location by decoding the url as shown below:
fullPath Starts With |
Root Path on Ruddle |
---|---|
gpfs_illumina/sequencer |
/gpfs/ycga/illumina/sequencer |
ba_sequencers |
/ycga-ba/ba_sequencers |
sequencers |
/gpfs/ycga/sequencers/panfs/sequencers |
For example, if the sample link you received is:
http://sysg1.cs.yale.edu:2011/gen?fullPath=sequencers2/sequencerV/runs/131107_D00306_0096... etc
The path on the cluster to the data is:
/gpfs/ycga/sequencers/panfs/sequencers2/sequencerV/runs/131107_D00306_0096... etc
Public Datasets
We host datasets of general interest in a loosely organized directory tree in /gpfs/ycga/datasets
:
├── annovar
│ └── humandb
├── db
│ └── blast
├── genomes
│ ├── Aedes_aegypti
│ ├── Bos_taurus
│ ├── Chelonoidis_nigra
│ ├── Danio_rerio
│ ├── Gallus_gallus
│ ├── hisat2
│ ├── Homo_sapiens
│ ├── Macaca_mulatta
│ ├── Monodelphis_domestica
│ ├── Mus_musculus
│ ├── PhiX
│ └── tmp
└── hisat2
└── mouse
If you would like us to host a dataset or questions about what is currently available, please contact us.
Storage
Ruddle's filesystem, /gpfs/ycga
, is where home, project, and scratch60 directories are located. For more details on the different storage spaces, see our Cluster Storage documentation. Ruddle's old ycga-ba
filesystem has been retired.
You can check your current storage usage & limits by running the getquota
command. Your ~/project
and ~/scratch60
directories are shortcuts. Get a list of the absolute paths to your directories with the mydirectories
command. If you want to share data in your Project or Scratch directory, see the permissions page.
Warning
Files stored in scratch60
are purged if they are older than 60 days. You will receive an email alert one week before they are deleted.
Partition | Root Directory | Storage | File Count | Backups |
---|---|---|---|---|
home | /gpfs/ycga/home |
125GiB/user | 500,000 | Yes |
project | /gpfs/ycga/project |
1TiB/group, increase to 4TiB on request | 5,000,000 | No |
scratch60 | /gpfs/ycga/scratch60 |
20TiB/group | 15,000,000 | No |