We are trying to apply our in-house hail sample qc script to the WES data on the UKB RAP (hail_sample_qc script attached below).
hail_sample_qc.txt (12.4 KB)
We are beginning to test its use on the UK Biobank’s cloud service by running it on only one chromosome pVCF (starting small with chr21 and chrY). In this case, the code we are running is:
python hail_sample_qc.py file:///opt/notebooks/LCR-hg38-noHLA.interval_list 'test' file:///opt/notebooks/with_chr_noMT_dbsnp144.b38.vcf.gz 'file:///mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_cY_b*_v1.vcf.gz' --coding-intervals file:///opt/notebooks/xgen_plus_spikein.GRCh38.bed
where the LCR file specifies low-complexity regions, and the with_chr_noMT_dbsnp144.b38.vcf.gz file is our reference VCF.
This script has been used successfully in the past in our institute’s computing cluster on locally stored VCF files.
As shown below, hail seems to initiate properly and even successfully completes several initial steps. However, at the start “Stage 3”, the script seems to stall and seemingly gets stuck. I have also attached the log files associated with two attempts at running the script where the script seems to stall.
hail.log (2.6 MB)
hail2.log (1.3 MB)
For reference, I am working on a Spark JupyterLab cluster with 96 cores, 187.5GB of total memory, and 9600GB of total storage. Each chromosome contains the WES data of roughly 470,000 participants.
Does anyone have suggestions for what may be causing this stalling? I appreciate the help.
It’s not really stalling, it’s using a single core to do a lot of work.
Hail is sensitive to the amount of parallelism. You want to use as much parallelism as possible, always. Using operations that load all the data onto a single machine, like
to_pandas, or operations that have to produce a single file, like
parallel=, reduce parallelism.
I recommend a few changes:
write to a Hail MatrixTable after importing and use that Hail MatrixTable file for everything else. MatrixTable format is a binary format. It’s more efficient for analysis than VCF.
- Instead of using
to_pandas directly from a pipeline,
write whatever you want to convert to pandas first then
read the result and
Let me know if you still see slow single-partition tasks after making these changes.
Also, if you have 470,000 WES you probably need a lot more than 96 cores! I strongly recommend using spot instances with an autoscaling Spark cluster. Spot instances cost ~1/3 of a normal instance and Hail is designed to work with this kind of instance. That means you can use 3x as many cores for the same price! I also recommend using any disk space. You don’t need a lot of disk space. Hail is designed to work with Amazon S3 directly. Make sure you
hl.init(tmpdir='s3://...') so that you use S3 instead of HDFS. These changes should let you use a lot more parallelism for the same cost.
I am re-posting this reply with some updates.
Thank you for the suggestions you provided us. We implemented your recommendations into our script
hail_sample_qc_YL_MG_072523.txt (13.8 KB) and found that it significantly improved our speed.
However, even with these changes, our projected time to process the entire exome of all 470k participants is impractically long. As a trial run, we tested the updated script on the singular ChrY block (for context, this singular ChrY block is 6GB, and the average block for one of the autosomal chromosomes is about ~23GB). It took about 40 minutes to export the QC stats for the LCR-pruned variants, and another 40 minutes to export the QC stats for the LCR-pruned, coding variants. The script ultimately ran for over 2 hours calculating the identity by descent estimates before I cancelled the job. UPDATE: We previously considered performing an identity by descent analysis, but we have since decided to drop this component of this Sample QC step.
We are still using the same JupyterLab Spark cluster environment settings (96 cores, 187.5GB of total memory, and 9600GB of total storage). I have asked the DNAnexus bioinformatics team if there is an autoscaling Spark cluster option, but have not heard back. (UPDATE: There is no autoscaling feature, and all core and node parameters must be chosen at the initiation of the JupyterLab Spark cluster environment).
UPDATE: They have suggested creating environment settings that use significantly less cores per node (on the order of 8 or 16 cores per node instead of the 96 I was previously using on my tests). Given that the primary functions we are now running are hl.sample_qc (once on the LCR-filtered variants, and once on the LCR-filtered, coding region variants), and hl.variant_qc, we were wondering if you had suggestions for the amount of nodes and cores that would be efficient at processing the WES of the nearly 500k samples.
Some frames of reference are provided for the hl.sample_qc function here but I think you would know best if the time estimates provided here scale linearly by increasing the number of participants 5-fold, and the amount of genomic data (in terms of GB of data stored under the pVCF) by roughly 7.3-fold.
We greatly appreciate whatever suggestions you can provide.