Counting Rows More Quickly in VDS

danking · July 17, 2023, 2:20pm

@jwillett Sure! The high-level answer is that it saves you unnecessary work and glue code with basically no downside.

The time (ergo cost of core-hours) should be roughly the same. The bulk of the cost is processing the genotype data or variant metadata and that work is indeed cleanly partition-able.

I recommend against writing per-partition code because, in every case I have seen thus far, people run exactly the same code for every autosome. If the code is the same, there’s no benefit to separately processing each chromosome, but there are costs:

Each execution requires some “driver-side” time to orchestrate this work. You have to wait for this 22 times instead of once.
For operations which aggregate across the entire dataset, you must manually combine the per-chromosome results. This is a source of bugs, particularly when your aggregations become more complex. For a simple example, consider counting per-sample, the number of hets, hom-refs, and hom-alts. The aggregate_cols command will produce a dictionary for each sample. You have to write some glue code which loops over the list of lists of dictionaries and sums each sample’s dictionary across chromosomes. That code already exists inside Hail and has been extensively tested.
Similarly to (2), if you want to use variants from many chromosomes as the input to a PCA or other linear algebraic operation, you have to write glue code to stitch together the 22 autosomes. If your dataset is stored as one matrix table, you can just use the glue code that already exists inside Hail.

Finally, if you ever need to do per-chromosome work, it’s still trivial to do that on a combined matrix table:

for chr in chromosomes:
    chr_mt = mt.filter_rows(mt.locus.contig == chr)

Hail recognizes that you’re filtering on the primary key of the dataset (the locus and the alleles). It generates a query plan which will only read the partitions that contain data from this chromosome. The runtime & cost should be indistinguishable from reading from a single chromosome matrix table.

Topic		Replies	Views
Hail sample_qc results Hail Query & hailctl	15	449	September 7, 2022
Looking for a "count()" alternative for QC Hail Query & hailctl	2	476	November 6, 2018
When to densify? Hail Query & hailctl	3	267	September 22, 2023
Poor performance for QC filtering on medium sized genotype data Hail Query & hailctl	20	2164	February 8, 2020
Filtering samples from VDS in Google cloud Hail Batch & General Cloud	11	513	August 6, 2024

Counting Rows More Quickly in VDS

Related topics