I am working in the Hail exome matrix table in the All of Us biobank. Once I subset the MT for my cohort of interest, I have been exporting to VCF. This is for a couple of reasons, but mainly b/c the algorithms used by my group read the variants from a VCF. In doing this, I’ve noticed that I lose about 30K variants per sample b/w the MT and VCF. Is this a known issue, or is it that the reference genotypes in the MT (0/0) are exported as missing in the VCF (./.)? If so, is there a way to bypass this issues, such that I get the reference calls in the VCF?
Hi @jasmussen!
I’m not sure what’s going on here. Could you share the code you’re running? If it’s possible to condense to a small example that shows the issue, that would make it easiest to identify the issue, but anything is helpful.
Thank you for the response. Here is my basic approach. It is super simple -
First, I have a list of samples I want to include in the MT, so I filter columns with those IDs:
mt = mt.filter_cols(hl.literal(patient_ids).contains(mt.s))
Then, I remove variants w/o genotypes in my set of samples:
mt = hl.variant_qc(mt)
mt = mt.filter_rows(mt.variant_qc.AC[1] > 0, keep = True)
Then, I export to VCF:
output_prefix = my_bucket + ‘/outFile’
hl.export_vcf(mt, f’{output_prefix}.vcf.bgz’)
That is it. When I run bcftools stats on the outFile.vcf.bgz I get about 30K less variants then what I started with in the Hail MT. I got the Hail MT per sample SNP counts by the following command:
I got the Hail MT per sample SNP counts by the following command:
Was that run on the mt before or after filtering out variants with AC = 0?
I don’t know how bcftools computes stats, but if you reimport the vcf to a matrix table and compute the counts, do you still see a discrepency with before the export?
I got the per sample SNP counts from the MT after filtering for AC = 0 variants.
Yes, I just tested it and after I re-import the VCF as a Hail MT the samples are no longer “missing” the 30K variants.
I thought ‘sample_qc.n_snp’ counted the total number of alternative alleles in each sample. I am very confused b/c bcftools stats should also count the total number of alternative alleles per sample (nNonRefHom + nHets). I’m not sure how to explain the difference.
Okay great, so the vcf isn’t missing any data, there’s just a discrepency in how the stats are being computed.
sample_qc.n_snp counts the number of alternate alleles in snp sites. I’m guessing the “missing” 30k variants are not snps. Try this to compute the count of all alternate alleles:
Thank you for the feedback! That would make sense to me.
I was trying to run the command you provided and got an error. It ran after I include some additional parentheses:
mt.select_cols(n_alt_alleles=hl.agg.sum(mt.GT.n_alt_alleles())).cols().to_pandas()
The output is a df with the sample ID and the sum of the n_alt_alleles, which is approx 122K per sample. That is 20K more than the sample_qc.n_snp. Is that what you were expecting?
I was hoping it would arrive at the same counts that you’re getting from bcftools, to be sure there’s no discrepency in the exported vcf. It sounds like they still don’t quite agree?
Oh, I missed this! The code I shared counted the number of alternate alleles, meaning a hom non-ref genotype adds 2 to the sum. This will count the number of sites with at least one alt allele:
Alright, I didn’t get a chance to start it until late yesterday. The numbers are closer, but still don’t match bcftools. The average per sample is 89K, so more than bcftools. However, I did go back to the bcftools stats and this time I did the adjustment from that side of things. I did this –
2*nNonRefHom + nHets
With doubling the nNonRefHom count, I get the same number as the sample_qc.n_snp from Hail.
Here is a recap –
bcftools stats (nNonRefHom + nHets) = 73.6K sites with alternative alleles per sample
bcftools stats (2*nNonRefHom + nHets) = 101.9K alternative alleles per sample
Hail MT sample_qc.n_snp = 101.9K alternative alleles per sample
Hail MT n_alt_alleles = 122K alternative alleles per sample
Hail MT n_non_hom_ref = 88.6 sites that are non-hom-ref per sample
At this point, I feel good about the sample_qc.n_snp versus bcftools. Thank you so much for helping to sort it out!
Regarding the difference with Hail MT n_non_hom_ref and bcftools (nNonRefHom + nHets), is Hail counting SNVs and indels that occur at the same site? I know you mentioned this in one of your earlier posts and, I know in bcftools I’m only counting SNVs.