'GT' are present in MT but missing from VDS for same samples

Hi, hope all is well.
I’m working on the AllofUs platform to calculate some PRS. I’m using a custom code and Hail MT for calculations, which is working but is time-consuming and expensive.
A part of the code is subsetting the variants in the PRS weights df from the MT, and I came across @tpoterba post about filtering VDS using hail.vds.filter_intervals which is SUPER fast compared to MT. The issue is that most samples are missing ‘GT’ even though they are present in the MT.
image

My code:

PGS_sites = hl.import_locus_intervals(interval_file, reference_genome=‘GRCh38’, skip_invalid_intervals=True)
vds_prs = hl.vds.filter_intervals(vds_subset, PGS_sites, keep=True)

The number of variants from a PRS score present in MT:
download

The number of variants from the same PRS score present in VDS for the same cohort:
download

Am I missing something here? Any thoughts?

Thank you!
Ahmed

What code are you using to show the genotypes? vds.variant_data.GT.show() shows only the variant data. The VDS is a split representation: reference data and variant data are stored separately, each on in its own Hail Matrix Table: vds.reference_data and vds.variant_data.

If, for your analysis, you need reference information (in particular, if you need to decide if a given sample is a confident homozygous reference call instead of a no call), you need to densify the VDS. There’s some more information about this at the AoU VDS page.

I recommend you seriously consider using the smaller, pre-densified callsets unless you absolutely need genome-wide data. Densification is not a cheap process, it’s converting from a very sparse representation to a very large and dense representation. This is particularly true if you only need GTs! The majority of the VDS is quality metadata like PL and AD.

Hi @danking, thank you for your response!
Yes, I used vds.variant_data.GT.show() to show the genotypes.

Does that mean an NA could either be a nocall or a homozygous reference? I queried vds.reference_data for some of the variants with NA in the vds.variant_data and they also have NA also in the reference_data
image

image

Is it safe to assume that the variants with NA are homozygous reference?

Does that mean an NA could either be a nocall or a homozygous reference?

Yes, that’s exactly what NA means.

The reference data is not stored row-by-row. The reference data is stored like a GVCF: there are “reference blocks” which implicitly span multiple rows.

Is it safe to assume that the variants with NA are homozygous reference?

No it is not.

If you need to confidently know if a sample is homozygous reference or no call you must use hl.vds.to_dense_mt:

mt = hl.vds.to_dense_mt(vds)
mt.GT.show()  # if you see NA here they are no calls

Note that densification is an expensive and time-consuming process! You probably want to filter to a small set of intervals-of-interest first using hl.vds.filter_intervals:

vds = hl.vds.filter_intervals(
    vds,
    hl.literal([
        hl.locus_interval('chr1', 123, 456)
    ])
)

If possible, I recommend using the pre-densified datasets because the All of Us project has already paid the cost of producing dense data at those loci!

1 Like

Great! Thank you so much, Dan. I appreciate the help!

Hi @danking, sorry to keep bugging you with this but I have a follow-up question.
I’m trying to understand how Hail treats the missing calls.
When using mt.GT.show I could see the missing calls as NA:

image

But when using hl.is_missing, they are still NA instead of True I guess:
image

Same when using is_defined:
image

My question is, why are these calls ‘ignored’?
Thank you!

Hey! Sorry for the delay, I answered in your other question How does Hail treat missing calls?