@jwillett Sure! The high-level answer is that it saves you unnecessary work and glue code with basically no downside.
The time (ergo cost of core-hours) should be roughly the same. The bulk of the cost is processing the genotype data or variant metadata and that work is indeed cleanly partition-able.
I recommend against writing per-partition code because, in every case I have seen thus far, people run exactly the same code for every autosome. If the code is the same, there’s no benefit to separately processing each chromosome, but there are costs:
- Each execution requires some “driver-side” time to orchestrate this work. You have to wait for this 22 times instead of once.
- For operations which aggregate across the entire dataset, you must manually combine the per-chromosome results. This is a source of bugs, particularly when your aggregations become more complex. For a simple example, consider counting per-sample, the number of hets, hom-refs, and hom-alts. The
aggregate_cols
command will produce a dictionary for each sample. You have to write some glue code which loops over the list of lists of dictionaries and sums each sample’s dictionary across chromosomes. That code already exists inside Hail and has been extensively tested. - Similarly to (2), if you want to use variants from many chromosomes as the input to a PCA or other linear algebraic operation, you have to write glue code to stitch together the 22 autosomes. If your dataset is stored as one matrix table, you can just use the glue code that already exists inside Hail.
Finally, if you ever need to do per-chromosome work, it’s still trivial to do that on a combined matrix table:
for chr in chromosomes:
chr_mt = mt.filter_rows(mt.locus.contig == chr)
Hail recognizes that you’re filtering on the primary key of the dataset (the locus and the alleles). It generates a query plan which will only read the partitions that contain data from this chromosome. The runtime & cost should be indistinguishable from reading from a single chromosome matrix table.