Appending to an existing Matrixtable


I had a problem statement and wanted to check with the community as to what will be an ideal solution for this.

  1. I have an existing matrix table with say 5000 vcf files imported in it.
  2. I would like to append another vcf file to the above Mt
  3. Perform analysis say regression on the merged dataset and save it to disk.
  4. Repeat #2 and #3 many times.

The dataset will grow over time as each vcf file added subsequently is appended to the Mt.

Would union be an option or is there any other efficient solution? Thanks in advance!

To answer this, we need more information about these input VCFs. Are these project VCFs (GT, GQ, etc FORMAT fields) for a group of samples from sequencing data? Those cannot be losslessly combined (a site might appear in one VCF but not another).

If your VCFs are genotype data, it’s probably possible to combine since those have the same set of variants, and in that case I think union_cols is probably what you want.

Thanks @johnc1231. The VCFs are from genotyping. I tried implementing union_cols but hail does not permit overwriting to existing Mt (get an error that input and output query are the same). I am now creating a temporary Mt which is a union of the old Mt and the new VCF Mt every time a new VCF needs to be imported.

Was wondering if there is a more efficient solution out there.

It’s possible that once the MT is large (~100s of thousands of samples), that writing for each new VCF is going to be the dominating expense. In that case, you might consider not writing each time, but instead building a sort of log-structured merge tree where you only write the merged dataset when you hit some threshold, and do the regressions by joining with union_cols when the new samples come in.

I think the simple solution you’ve proposed will probably scale to half a million samples reasonably well, though.

Thanks @tpoterba. I am on a 4 core CPU with 8GB RAM and it took me ~4 hours to do a union of 100 VCFs so far. Is this normal?

Is each VCF a single sample, or a batch of samples?

Each VCF is a single sample.

Are they gVCFs?

They are not gVCFs.