Performing SampleQC using Hail on ~500k WES samples


I am wondering if you have any suggestions for how to proceed with using hail to sampleqc roughly 500k WES samples on the UK biobank database.

The database contains WES data in PLINK format (fractioned by chr), BGEN format (fractioned by chr), and VCFs (fractioned both by individual participants, and fractioned by chr region).

My colleague has suggested generating individual MTs for each of the individual participant VCFs, and then performing a union of all of the individual MTs, but I am unclear if there would be a more efficient method of aggregating the data for sample QC and then variant QC.

Do you have any suggestions for how to utilize hail efficiently at this scale? Any help would be appreciated. Thank you!

Hey @mgarcia !

A couple things:

  • No matter what, this will be a very expensive analysis. 500k samples is a huge data set. Quality control is an iterative and explorative process. Hail is complex software, you’ll probably make mistakes as you learn to use it.
  • Where/how did you get a VCF partitioned by sample? How that VCF was produced will matter a lot to how best to work with it.
  • We use the term “partition(ed)” to refer to splitting a dataset into multiple files.
  • You have to use a Spark cluster with lots of cores to do this effectively. Unless your institution can provide such a thing, you’ll need to use the Cloud. We recommend Google Cloud and Google Cloud Dataproc (Google’s Spark product).

Frankly, single-sample VCFs sounds like a nightmare to me. Unless those are single-sample “GVCFs” (these are produced by calling software like GATK or DRAGEN, they have things called “reference blocks”, you probably should try to use some format that hasn’t been split by sample.

Are you sure you have WES in BGEN format? BGEN generally doesn’t have a way to represent sequencing information like depth, allele depth, and phred-scaled likelihoods.

Hi @danking,

Thank you for your response. To answer your first question, I have attached this excerpt from the UK Biobank that references the first 200k exomes they released (same protocol for the full 500k):

Primary and secondary analysis for the UKB 200k release was performed with an updated Functional Equivalence (FE) protocol that retains original quality scores in the CRAM files (referred to as the OQFE protocol, The OQFE protocol aligns and duplicate-marks all raw sequencing data (FASTQs) to the full GRCh38 reference in an alt-aware manner as described in the original FE manuscript (Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects - PubMed). The OQFE CRAMs were then called for small variants with DeepVariant to generate per-sample gVCFs. These gVCFs were aggregated and joint-genotyped with GLnexus ( to create a single multi-sample VCF (pVCF) for all UKB 200k samples. PLINK files were derived directly from this pVCF. Please note: to ensure that the UKB 200k data supports a broad range of analyses, no variant- or sample-level filters were pre-applied to the pVCF or PLINK files. The publicly released pVCF is the direct output of GLnexus, from which the PLINK files are generated. The pVCF contains allele-read depths and genotype qualities for all genotypes from which variant- and sample-level QC metrics can be calculated and to which analysis-specific filters can be applied. Examples of such filtering are described in the UKB 200K preprint (

This might be a novice question, but why exactly is it preferable to use a GVCF in these large scale analyses over a VCF? I understand that GVCFs keeps records of all sites, whether or not there exists a variant call, but how exactly is that advantageous?

The UKBB uses AWS with an integrated Spark cluster in their JupyterLab tool.

According to the UKBB’s documentation, the BGEN files were derived from the PLINK files (derived from the pVCF):
“plink2 --bfile pvcf.norm --export bgen-1.2 bits=8 ref-first --out pvcf.norm_zlib”

I greatly appreciate your help. I am very new to bioinformatics as a whole, and this is incredibly useful.

OK, I understand your situation. The answer will take a bit of writing to properly describe and I’m a bit busy tomorrow. Hopefully I’ll have time tomorrow to type it all out

A post was split to a new topic: How do I use hl.import_vcf to import a VCF that has been partitioned into multiple files?

OK, I’ve been meaning to write most of this out anyway, so this is a good chance to do that. I start with some background and definitions before giving you practical advice.

We view sequencing as a series of transformations of data:

DNA ==sequencer=> FASTQ ==alignment=> BAM/CRAM ==variant caller=> GVCF ==joint caller=> PVCF

The FASTQ, BAM/CRAM, and GVCF datatypes all contain one sample per file. A PVCF contains zero or more samples per file.

The conversion from BAM/CRAM to GVCF is lossy. The reads are replaced by a set of statistics at each locus. For example, every locus will have a depth (DP) representing the number of reads that overlapped that locus. Every locus will also have an allele depth (AD) representing, per-allele, the number of reads that were evidence for the allele at that locus. For example, at a heterozygous locus, you might have a depth of 30 and an reference allele depth of 14 and an alternate allele depth of 15. Notice that there was one so-called “uninformative” read.

Moreover, the GVCF format is sparse. Instead of representing every locus with one line of the file, intervals of homozyogus reference loci are represented as a single line. These intervals are called “reference blocks”. Reference blocks have no alternate alleles, so there is no allele depth (AD) or phred-scaled likelihood (PL). Instead there is usually an overall depth (DP) and a genotype quality (GQ). The latter indicates our confidence in the homozygous reference calls in this interval. For example, if we only see five overlapping reads, we might not be confident that these are a reference calls.

Homozygous reference quality information is critical. If we entirely elided those reference blocks, we are unable to confidently say that the sample is homozygous reference at that locus. The sample could be either homozygous reference or “no call”. The latter is a form of missing data: we did not reliably observe the genotype at that locus for this sample.

A PVCF is always created from two or more GVCFs. The conversion from GVCF to PVCF is also lossy because the reference blocks are eliminated. This is not a problem for GWAS because the phenotype, which varies per sample, cannot correlate with a locus lacking variation.

The loss of reference blocks prevents adding new samples to an existing PVCF or combining two PVCFs. In either case, there may exist a variant at which one dataset has a variant record and the other dataset has no record at all. In that case, for samples without a record, we know only that they have either a homozygous reference call or “no call” (aka unknown genotype aka missing data).

In terms of bits-and-bytes, GVCF and PVCF are both valid VCFs though each has a different set of conventions. Most tools cannot process every valid VCF. Instead, they make assumptions about the VCF structure and only work on VCFs that satisfy those assumptions. There is a detailed specification of VCF version 4.3. There are also a number of descriptions of GVCF: Broad 1, Broad 2, and Illumina.

There is more to say about PVCFs, but let me first address PLINK and BGEN. There are actually two PLINK formats: PLINK1 (bed) and PLINK2 (pgen). I am not too familiar with either format, but I understand that they are meant to store “hard calls” meaning just the genotypes or the dosages. In particular, they cannot store sequencing metadata such as DP, AD, and PL. I believe PLINK2 supports multi-allelic sites. There are several versions of BGEN, but they are all essentially designed to store the genotype probabilities. BGEN 1.2 and later supports multi-allelic variants. Again, as far as I know, there is no support for arbitrary sequencing metadata fields.

We already noted that two PVCFs cannot be combined, but there is yet another problem: the typical representation of sequencing data in PVCFs grow super-linearly in the samples. In particular, at multi-allelic sites, each sample often adds another alternate allele. As a result, the AD array grows roughly linearly in samples. Moreover, the PL array, which has an entry for every possible pair of alleles, grows roughly quadratically in samples. The true relationship is a bit complicated, but empirically, we have observed super-linear scaling in samples of PVCFs.

GLNexus mitigates the size problem by eliding the PL array for homozygous reference calls. I’m not sure what other joint-callers do. In our experience, a PVCF with real PLs of 100,000 samples is prohibitively large. So large that, as far as we know, no tool can produce such a PVCF. There is a stalled discussion about modifying the VCF spec to encode AD and PL in O(1) space in the samples.

Hail has three fundamental data types: the Table, the MatrixTable, and the BlockMatrix. A table is what it sounds like. A BlockMatrix is for linear algebra. The MatrixTable is, essentially, a totally generic version of a VCF. When you run import_vcf you get a MatrixTable. MatrixTables are intended for use with “dense” data, such as normal PVCFs.

On top of these fundamental data types, we have built the Variant Dataset or VDS. A VDS is combinable: one or more VDSes may be combined with one or more GVCFs to cheaply produce a new VDS. A VDS succinctly represents AD, PL, and other sequencing metadata. The size of a VDS grows linearly in the samples and variants.

Generating a VDS from exomes costs 0.01 USD per exome on Google Dataproc, using autoscaling policies. Generating a VDS from genomes costs 0.10 USD per genome.

If you do not intend to add new samples to a dataset and also your dataset contains fewer than 10,000 samples, you should generate a PVCF. This representation is dense and therefore easier to use.

If someone has already produced a PVCF for you, you should use that. But be aware! Because of the issues stated above, some tools or methods (including those in Hail!) might not work correctly because fields missing.

If your goal is a GWAS and someone you trust has already produced a set of a high-quality variants and samples, I recommend filtering the BGEN or PLINK file to those samples and using that for GWAS.

If you want to do your own QC or do something different than GWAS, then Hail might be a good tool. You should use the PVCF that already exists.

Working in the cloud is really challenging. I have never used AWS for analysis, so my suggestions are necessarily vague. Make sure to use an auto-scaling cluster. Start small with just a handful of variants until you feel confident in your analysis. Do not convert a BGEN file to a Hail MatrixTable. BGEN is a more succinct format than Hail MatrixTable format. Hail has great support for BGEN. I wrote more details about using Hail on the cloud here.

Just to be crystal clear: If you have a PVCF not partitioned by sample but partitioned by variant, that is a fine choice. I am apprehensive about a PVCF that is partitioned by both sample and variant. I have never heard of that and I suspect it will create headaches downstream.

Thank you so much for your incredibly detailed response!
Yes, there is a PVCF partitioned by variant that appears to be our best choice moving forward. We greatly appreciate the help.

1 Like