Exporting Hail MT to VCF - Missing Genotypes

jasmussen · May 6, 2024, 5:16pm

Hi –

I am working in the Hail exome matrix table in the All of Us biobank. Once I subset the MT for my cohort of interest, I have been exporting to VCF. This is for a couple of reasons, but mainly b/c the algorithms used by my group read the variants from a VCF. In doing this, I’ve noticed that I lose about 30K variants per sample b/w the MT and VCF. Is this a known issue, or is it that the reference genotypes in the MT (0/0) are exported as missing in the VCF (./.)? If so, is there a way to bypass this issues, such that I get the reference calls in the VCF?

Thanks!

patrick-schultz · May 6, 2024, 7:14pm

Hi @jasmussen!
I’m not sure what’s going on here. Could you share the code you’re running? If it’s possible to condense to a small example that shows the issue, that would make it easiest to identify the issue, but anything is helpful.

jasmussen · May 6, 2024, 7:44pm

Hi @patrick-schultz!

Thank you for the response. Here is my basic approach. It is super simple -

First, I have a list of samples I want to include in the MT, so I filter columns with those IDs:
mt = mt.filter_cols(hl.literal(patient_ids).contains(mt.s))

Then, I remove variants w/o genotypes in my set of samples:
mt = hl.variant_qc(mt)
mt = mt.filter_rows(mt.variant_qc.AC[1] > 0, keep = True)

Then, I export to VCF:
output_prefix = my_bucket + ‘/outFile’
hl.export_vcf(mt, f’{output_prefix}.vcf.bgz’)

That is it. When I run bcftools stats on the outFile.vcf.bgz I get about 30K less variants then what I started with in the Hail MT. I got the Hail MT per sample SNP counts by the following command:

mt = hl.sample_qc(mt)
ol_df = mt.cols().to_pandas()

Please let me know if you need anything else!

patrick-schultz · May 6, 2024, 8:03pm

I got the Hail MT per sample SNP counts by the following command:

Was that run on the mt before or after filtering out variants with AC = 0?

I don’t know how bcftools computes stats, but if you reimport the vcf to a matrix table and compute the counts, do you still see a discrepency with before the export?

jasmussen · May 7, 2024, 2:56am

I got the per sample SNP counts from the MT after filtering for AC = 0 variants.

Yes, I just tested it and after I re-import the VCF as a Hail MT the samples are no longer “missing” the 30K variants.

I thought ‘sample_qc.n_snp’ counted the total number of alternative alleles in each sample. I am very confused b/c bcftools stats should also count the total number of alternative alleles per sample (nNonRefHom + nHets). I’m not sure how to explain the difference.

patrick-schultz · May 7, 2024, 11:54am

Okay great, so the vcf isn’t missing any data, there’s just a discrepency in how the stats are being computed.

sample_qc.n_snp counts the number of alternate alleles in snp sites. I’m guessing the “missing” 30k variants are not snps. Try this to compute the count of all alternate alleles:

mt.select_cols(n_alt_alleles=hl.agg.sum(mt.GT.n_alt_alleles)).cols().to_pandas()

jasmussen · May 7, 2024, 5:14pm

Thank you for the feedback! That would make sense to me.

I was trying to run the command you provided and got an error. It ran after I include some additional parentheses:
mt.select_cols(n_alt_alleles=hl.agg.sum(mt.GT.n_alt_alleles())).cols().to_pandas()

The output is a df with the sample ID and the sum of the n_alt_alleles, which is approx 122K per sample. That is 20K more than the sample_qc.n_snp. Is that what you were expecting?

patrick-schultz · May 7, 2024, 5:20pm

I was hoping it would arrive at the same counts that you’re getting from bcftools, to be sure there’s no discrepency in the exported vcf. It sounds like they still don’t quite agree?

jasmussen · May 7, 2024, 5:21pm

Correct, they still don’t agree!

VCF bcftools stats = ~70K SNPs per sample
Hail MT sample_qc stats = ~100K SNPs per sample
Hail MT n_alt_alleles = ~122K SNPs per sample

patrick-schultz · May 7, 2024, 5:41pm

Oh, I missed this! The code I shared counted the number of alternate alleles, meaning a hom non-ref genotype adds 2 to the sum. This will count the number of sites with at least one alt allele:

mt.select_cols(n_non_hom_ref=hl.agg.count(mt.GT.is_non_ref())).cols().to_pandas()

Hopefully that agrees with bcftools!

jasmussen · May 8, 2024, 3:22pm

Alright, I didn’t get a chance to start it until late yesterday. The numbers are closer, but still don’t match bcftools. The average per sample is 89K, so more than bcftools. However, I did go back to the bcftools stats and this time I did the adjustment from that side of things. I did this –
2*nNonRefHom + nHets
With doubling the nNonRefHom count, I get the same number as the sample_qc.n_snp from Hail.

Here is a recap –
bcftools stats (nNonRefHom + nHets) = 73.6K sites with alternative alleles per sample
bcftools stats (2*nNonRefHom + nHets) = 101.9K alternative alleles per sample
Hail MT sample_qc.n_snp = 101.9K alternative alleles per sample
Hail MT n_alt_alleles = 122K alternative alleles per sample
Hail MT n_non_hom_ref = 88.6 sites that are non-hom-ref per sample

At this point, I feel good about the sample_qc.n_snp versus bcftools. Thank you so much for helping to sort it out!

Regarding the difference with Hail MT n_non_hom_ref and bcftools (nNonRefHom + nHets), is Hail counting SNVs and indels that occur at the same site? I know you mentioned this in one of your earlier posts and, I know in bcftools I’m only counting SNVs.

patrick-schultz · May 8, 2024, 3:29pm

Great, glad we could get to the bottom of it!

Right, the code snippets I shared are counting all variants, not just SNVs. I didn’t realize bcftools only counts SNVs, but it all makes sense now!

Topic		Replies	Views
Gentoype filtering - missing hom ref data Hail Query & hailctl	2	27	March 20, 2025
Hail sample_qc results Hail Query & hailctl	15	449	September 7, 2022
Hail V01 to V02 missing genotypes? Hail Query & hailctl	5	606	May 9, 2019
Exporting Hail MT to VCF - FORMAT columns missing Hail Query & hailctl	3	53	March 31, 2025
Genotype matrix in hail 0.2 Hail Query & hailctl	5	751	April 15, 2019

Exporting Hail MT to VCF - Missing Genotypes

Related topics