Hello,
I am supporting a project using All of Us genomic data, which has opted to store their data in hail matrix tables (and other formats).
To reduce the file size, they have separated variant annotations in a separate table (Variant Annotation Table) that can be filtered with hail functions. The result is a table with chromosome, position, allele, gene, etc… but contains no information on individual samples.
How can I use this table to pull out samples from the matrix table that actually contains genotype information? Should I create intervals from the VAT and use this to filter the genotype table, and then retrieve the sample IDs per variant? Or perhaps some kind of inner join?
This seems like a very expensive and slow operation, and I’m not sure if there is a better way outside of exporting the variant annotations and filtering a VCF or PLINK file instead? I’d like to learn how to maximize the utility of hail matrix tables and when it is best to exit this data format.
Advice would be appreciated!