Variant Annotation Table Merge?

Hello,

I am supporting a project using All of Us genomic data, which has opted to store their data in hail matrix tables (and other formats).

To reduce the file size, they have separated variant annotations in a separate table (Variant Annotation Table) that can be filtered with hail functions. The result is a table with chromosome, position, allele, gene, etc… but contains no information on individual samples.

How can I use this table to pull out samples from the matrix table that actually contains genotype information? Should I create intervals from the VAT and use this to filter the genotype table, and then retrieve the sample IDs per variant? Or perhaps some kind of inner join?

This seems like a very expensive and slow operation, and I’m not sure if there is a better way outside of exporting the variant annotations and filtering a VCF or PLINK file instead? I’d like to learn how to maximize the utility of hail matrix tables and when it is best to exit this data format.

Advice would be appreciated!

Hi @Cecile_Avery,

Sorry for the slow response! I can try to help, but I’ll need more details about what you’d like to do. Do you want to filter the VAT to some variants, then extract all samples containing that variant?

The full AoU dataset is very large, and you can expect any operation that needs to read the genotype data to be relatively slow and expensive. But at this scale Hail really is the best option. If your goal is to extract a small subset of the data, then converting to another format and using more familiar tools may well be a good option.

hi @ Cecile_Avery and @ patrick-schultz, i think i’m running into a similar problem like this one. I am trying to merge demographic information to filtered variants from the VAT. were you able to solve this issue?