Hi, hope all is well.
I’m working on the AllofUs platform to calculate some PRS. I’m using a custom code and Hail MT for calculations, which is working but is time-consuming and expensive.
A part of the code is subsetting the variants in the PRS weights df from the MT, and I came across @tpoterba post about filtering VDS using hail.vds.filter_intervals which is SUPER fast compared to MT. The issue is that most samples are missing ‘GT’ even though they are present in the MT.
What code are you using to show the genotypes? vds.variant_data.GT.show() shows only the variant data. The VDS is a split representation: reference data and variant data are stored separately, each on in its own Hail Matrix Table: vds.reference_data and vds.variant_data.
If, for your analysis, you need reference information (in particular, if you need to decide if a given sample is a confident homozygous reference call instead of a no call), you need to densify the VDS. There’s some more information about this at the AoU VDS page.
I recommend you seriously consider using the smaller, pre-densified callsets unless you absolutely need genome-wide data. Densification is not a cheap process, it’s converting from a very sparse representation to a very large and dense representation. This is particularly true if you only need GTs! The majority of the VDS is quality metadata like PL and AD.
Hi @danking, thank you for your response!
Yes, I used vds.variant_data.GT.show() to show the genotypes.
Does that mean an NA could either be a nocall or a homozygous reference? I queried vds.reference_data for some of the variants with NA in the vds.variant_data and they also have NA also in the reference_data
Is it safe to assume that the variants with NA are homozygous reference?