Outer Joining Multiple VCFs

Hello!
I wanted to see if there’s a simple way to outer join multiple VCFs in Hail. I have ~50,000 VCFs each of which has somatic calls from 1 individual, and basically my goal is to perform QC on each VCF (ex: removing non-PASS variants and low-qual genotypes), then join all the 50,000 VCFs into 1 hail .mt (I don’t need any entry-level info on each variant except for GT and VAF, VEP-annotate that .mt, and export_entries on for downstream analysis.

Would be fantastic if there was a way to do this all within hail.
Hope to hear from you soon!
Thanks,
Maryam

The solution here will involve a hierarchical merge using union_cols(..., row_join_type='outer')

1 Like

I’ve written a function to do the basics here (it was designed for non-genotype data, but hopefully it should be generic enough): https://github.com/Nealelab/ukb_common/blob/master/utils/results_loading.py#L105