'GT' are present in MT but missing from VDS for same samples

akhattab · September 19, 2023, 6:55pm

Hi, hope all is well.
I’m working on the AllofUs platform to calculate some PRS. I’m using a custom code and Hail MT for calculations, which is working but is time-consuming and expensive.
A part of the code is subsetting the variants in the PRS weights df from the MT, and I came across @tpoterba post about filtering VDS using hail.vds.filter_intervals which is SUPER fast compared to MT. The issue is that most samples are missing ‘GT’ even though they are present in the MT.

My code:

PGS_sites = hl.import_locus_intervals(interval_file, reference_genome=‘GRCh38’, skip_invalid_intervals=True)
vds_prs = hl.vds.filter_intervals(vds_subset, PGS_sites, keep=True)

The number of variants from a PRS score present in MT:
download

The number of variants from the same PRS score present in VDS for the same cohort:
download

Am I missing something here? Any thoughts?

Thank you!
Ahmed

danking · September 20, 2023, 8:31pm

What code are you using to show the genotypes? vds.variant_data.GT.show() shows only the variant data. The VDS is a split representation: reference data and variant data are stored separately, each on in its own Hail Matrix Table: vds.reference_data and vds.variant_data.

If, for your analysis, you need reference information (in particular, if you need to decide if a given sample is a confident homozygous reference call instead of a no call), you need to densify the VDS. There’s some more information about this at the AoU VDS page.

I recommend you seriously consider using the smaller, pre-densified callsets unless you absolutely need genome-wide data. Densification is not a cheap process, it’s converting from a very sparse representation to a very large and dense representation. This is particularly true if you only need GTs! The majority of the VDS is quality metadata like PL and AD.

akhattab · September 21, 2023, 12:44am

Hi @danking, thank you for your response!
Yes, I used vds.variant_data.GT.show() to show the genotypes.

Does that mean an NA could either be a nocall or a homozygous reference? I queried vds.reference_data for some of the variants with NA in the vds.variant_data and they also have NA also in the reference_data

Is it safe to assume that the variants with NA are homozygous reference?

danking · September 21, 2023, 2:40pm

Does that mean an NA could either be a nocall or a homozygous reference?

Yes, that’s exactly what NA means.

The reference data is not stored row-by-row. The reference data is stored like a GVCF: there are “reference blocks” which implicitly span multiple rows.

Is it safe to assume that the variants with NA are homozygous reference?

No it is not.

If you need to confidently know if a sample is homozygous reference or no call you must use hl.vds.to_dense_mt:

mt = hl.vds.to_dense_mt(vds)
mt.GT.show()  # if you see NA here they are no calls

Note that densification is an expensive and time-consuming process! You probably want to filter to a small set of intervals-of-interest first using hl.vds.filter_intervals:

vds = hl.vds.filter_intervals(
    vds,
    hl.literal([
        hl.locus_interval('chr1', 123, 456)
    ])
)

If possible, I recommend using the pre-densified datasets because the All of Us project has already paid the cost of producing dense data at those loci!

akhattab · September 22, 2023, 5:06pm

Great! Thank you so much, Dan. I appreciate the help!

akhattab · September 27, 2023, 5:36pm

Hi @danking, sorry to keep bugging you with this but I have a follow-up question.
I’m trying to understand how Hail treats the missing calls.
When using mt.GT.show I could see the missing calls as NA:

But when using hl.is_missing, they are still NA instead of True I guess:

Same when using is_defined:

My question is, why are these calls ‘ignored’?
Thank you!

danking · October 4, 2023, 2:39pm

Hey! Sorry for the delay, I answered in your other question How does Hail treat missing calls?

Topic		Replies	Views
gVCF values in VDS reference data Hail Query & hailctl	0	29	January 22, 2025
Hail V01 to V02 missing genotypes? Hail Query & hailctl	5	607	May 9, 2019
When to densify? Hail Query & hailctl	3	267	September 22, 2023
Call Rate NAs after GT revising Hail Query & hailctl	2	291	October 28, 2021
Running polygenic risk scoring with VariantDataset Science	3	600	May 15, 2023

'GT' are present in MT but missing from VDS for same samples

Related topics