Counting Rows More Quickly in VDS

I am presently working with Hail on the All of Us dataset. I am in the process of doing QC on some variants in a VDS file. I have figured out the code that I need to do at least some of the QC and moving towards producing bgen files (so I can work on them using Plink). A major roadblock I am facing is counting the number of genetic variants removed following each step of QC. I have code to do this, but it will take too long to be useful.

This works pretty well in counting variant alleles in a dataset I produced after splitting multi-allelics, running in a couple of seconds:
print("Num variants total (after split):",hl.eval(vds_split.reference_data.count_rows()))

I then remove a number of samples from the VDS file using hl.vds.filter_samples. When I try to run this same line of code using the appropriate new object, I get a progress bar that moves slowly suggesting I have 8 workers operating on 84k operations:
[Stage 0:> (1 + 8) / 84648]

I am hoping to get others’ advice on how to navigate this. Perhaps I just need to remove samples in Plink downstream when I get to bgen files, although if I need to make better code, I would appreciate guidance!

On looking at other posts, it seems this is due to the “recipe” structure of Hail. Naturally, a challenge with my data is that I am working with 1 billion SNPs. I will be doing a series of QC steps to remove lower quality variants. Any advice on most efficiently doing this (so filtering a MT by QC scores) would be appreciated!

@jwillett , it’s a bit hard to give advice in the abstract. Can you share the script you’re running which demonstrates your analysis? We can advise directly on how to achieve the same results with less compute.

That said, sequencing data with all its quality metadata is really really big! This analysis will necessarily be expensive, but it need not be time-consuming! Hail was designed from the ground up to grant genetics the power of massive scaling. It looks to me that you’re only using 8 cores:

[Stage 0:> (1 + 8) / 84648]

When working with sequencing datasets as large as AoU, you have to use thousands of cores in parallel. I’m not too familiar with the AoU RWB. In Google Dataproc, you can set an autoscaling policy so that your cluster automatically shrinks and grows in response to the amount of work. I recommend scaling up to at least 500 spot instances. I want to reiterate: use spot instances. Do not use non-spot (also known as non-preemptible) instances! Those are five times as expensive. Hail has been carefully designed such that most operations (anything that doesn’t use key_rows_by or perform matrix multiplications) are tolerant of spot instances. Here’s an autoscaling policy we use internally for some of our work:

  cooldownPeriod: 120s
    gracefulDecommissionTimeout: 120s
    scaleDownFactor: 1.0
    scaleUpFactor: 1.0
  maxInstances: 500
  weight: 1
  maxInstances: 2
  minInstances: 2
  weight: 1

Thanks for the worker guidance, I can increase them to what I need for intermediate steps at least. I have noted that Hail 0.2 no longer has the filter_variant_expr function for VDS files. As it will take more time to process the MT after running “to_dense_mt”, is there a way to remove the following variants I get from the following code operating on a vds file, following the isolation of variants on a single chromosome?

{frozenset(): 19975203,
 frozenset({'ExcessHet'}): 18057,
 frozenset({'ExcessHet', 'NO_HQ_GENOTYPES'}): 11,
 frozenset({'LowQual'}): 72251,
 frozenset({'LowQual', 'NO_HQ_GENOTYPES'}): 420139,
 frozenset({'NO_HQ_GENOTYPES'}): 590664}

Where are you seeing verbiage about filter_variant_expr? That is from Hail 0.1 which is nearly five years old now. I want to scrub the Internet of any references to that. The “Variant Dataset” in Hail 0.1 has nothing to do with the the Hail Variant Dataset class in 0.2. It is an unfortunate name collision that I probably should have fought harder against.

The VariantDataset class has filter_variants for filtering variants.

Aside: I strongly recommend against running your code per-chromosome. Hail is designed to process, massively in parallel, the whole dataset at once.

Can you describe, maybe in pseudo code, what you’re trying to do? We can better help suggest options if you have some examples of things that you think should work but don’t.

The Hail 0.1 code was from ChatGPT. I will avoid running code per-chromosome, thanks for the tip.

I am ultimately aiming to remove all flagged variants (per the filter field). I then want to do “variant_qc”, and then filter out variants not meeting additional metrics. I will also filter out around 550 samples using “filter_samples” but I believe my code is working for that.

1 Like

Alright, that seems reasonable and filter_variants should do the trick. I recommend reading the data model explainer for VDS. The key thing to keep in mind is that the variant data doesn’t hold any reference information and you can’t assume that every sample without a variant call is homref: some could be low quality “no calls”.

Converting from the sparse VDS representation to the dense MatrixTable representation (which is what to_dense_mt does) is expensive. If it’s possible to start your analysis from the smaller, dense, pre-filtered datasets you’ll have a much easier time (and you can use PLINK without any Hail at all!) because AoU already did the expensive densification for you.

How about converting to the merged_sparse_mt? There is unfortunately no way around this conversion for my analyses.

You probably don’t want that representation. It preserves the VDS/SVCR sparse representation but using a single underlying matrix table rather than two. We developed that format for gnomAD v3 before we discovered/developed the VDS format. It’s mostly useful for gnomAD team who already have some code written to use that format. That format is a bit less efficient to analyze than the VDS format.

to_dense_mt is the only way to get a dense matrix of genotypes.

Great, thanks!

Could you please explain why you would not want to do it by chromosome? I looked at another post that suggested that the cost scales by variants, so theoretically the cost for doing each chromosome separately would be the same as all together.

@jwillett Sure! The high-level answer is that it saves you unnecessary work and glue code with basically no downside.

The time (ergo cost of core-hours) should be roughly the same. The bulk of the cost is processing the genotype data or variant metadata and that work is indeed cleanly partition-able.

I recommend against writing per-partition code because, in every case I have seen thus far, people run exactly the same code for every autosome. If the code is the same, there’s no benefit to separately processing each chromosome, but there are costs:

  1. Each execution requires some “driver-side” time to orchestrate this work. You have to wait for this 22 times instead of once.
  2. For operations which aggregate across the entire dataset, you must manually combine the per-chromosome results. This is a source of bugs, particularly when your aggregations become more complex. For a simple example, consider counting per-sample, the number of hets, hom-refs, and hom-alts. The aggregate_cols command will produce a dictionary for each sample. You have to write some glue code which loops over the list of lists of dictionaries and sums each sample’s dictionary across chromosomes. That code already exists inside Hail and has been extensively tested.
  3. Similarly to (2), if you want to use variants from many chromosomes as the input to a PCA or other linear algebraic operation, you have to write glue code to stitch together the 22 autosomes. If your dataset is stored as one matrix table, you can just use the glue code that already exists inside Hail.

Finally, if you ever need to do per-chromosome work, it’s still trivial to do that on a combined matrix table:

for chr in chromosomes:
    chr_mt = mt.filter_rows( == chr)

Hail recognizes that you’re filtering on the primary key of the dataset (the locus and the alleles). It generates a query plan which will only read the partitions that contain data from this chromosome. The runtime & cost should be indistinguishable from reading from a single chromosome matrix table.

I’ll be doing GWAS, ultimately merging bgen files for this. I’ll proceed with this then for all chromosomes. I ran it on a single chromosome thus far to validate my pipeline. Thanks for answering all my questions!

1 Like