Dear Hail Team:
My colleague exported her QC work results using Hail through extensively sliced VCFs (>100TB) per study and it too me a few months to process and merge the VCF files so that I can us them in my analysis. But I just got to know that she exported LGT on the VCF instead of GT. This means the the work I have been doing largely wasted, and I found the only way to convert the LGT VCF to GT VCF is through hail functions.
My questions are
(1) When Hail expert genotype data into VCF, will it automatically convert LGT to GT? as LGT VCF seems not following the VCF standard.
(2) If there is not an mechanism from this happening and export LGT to the same Genotype fields on VCF, will there be an faster methods besides running on Hail to correct the LGT to GT?
Thanks very much, as this is critical and very time sensitive.
Does the VCF also contain a LA – local alleles – field? The LGT to GT conversion is not some kind of proprietary Hail algorithm, but a simple translation through using local allele indices. If the LA field has been dropped, there is not enough information to recover GT.
Thanks Tim, for our VCF files we have the LA along with other FORMAT fields removed from the VCF. The only way we can correct this error in VCF would be download all the original VCF that contains LA and LGT and proceed with the conversion again. And this will be very expensive to do through hail. We still hope to do this kind of conversion of LGT in VCF to GT in VCF through local cluster, do you know if there is any software or scripts besides Hail to convert the LGT to GT? Our cluster dose not support spark and we can not run Hail on our cluster.
And I strongly belive if the exported genotype data in VCF following the VCF 4.2 standard, then LA is not needed really, as this index is just recording the position of order of the different alternative alleles of the variants. Is that correct?
And I believe LGT was introduced primilar for Hail to handle multi-allelic sites.
Converting from LGT to GT is trivial, but it depends on the LA. The LA encodes how local alleles map to global alleles. If you’re concerned about network traffic, perhaps your partner can extract the LA fields to a TSV first and then you can download the TSV?
The algorithm for converting LGT to GT.
- Let LGT = X/Y.
- Let LA be the local alleles array (an array of integers).
- GT is
LA[X]
/LA[Y]
.
For example:
- LGT = 1/2
- LA = [0, 2, 3]
- GT = 2/3
The VCF spec does not require a GT field. The VCF format is a general purpose format for matrix-structured data whose rows are indexed by genomic variants.
I do, however, agree that the Hail library should provide a vds_to_vcf
function which “does the right thing”. We’re working as quickly as we can to address all these issues, but the team is spread thin across a large project. I expect in a year’s time, there will be less opportunity for a user to make mistakes when exporting.
For future reference, if you want to create a GT field before exporting, you can do this:
mt = mt.annotate_entries(
GT = hl.vds.lgt_to_gt(mt.LGT, mt.LA)
)
hl.export_vcf(mt, 'gs://path/to/data.vcf.gz')
This is almost right. The LA is not a “bijection” aka reordering. It’s an “injection”. There might be 40 alleles across all samples at a given variant but at any sample there can be at most two alternate alleles. The LA converts from the 1-3 locally relevant alleles to the 40 globally relevant alleles.
E.g.:
alleles = ["A", "AT", "ATT", "ATTTT", .... "AT...T"]
And for a particular sample:
LGT = 0/1
LA = [0, 3]
In terms of nucleotides, the LGT in this example is “A/ATTTT”
This seems silly for GTs but is critical for sparse representations of PL and AD. We opted for a uniformly local representation (LGT, LPL, LAD), but that does mean you have to convert from LGT to GT. There’s more details in this google doc and the companion slides.
I empathize with the difficulty of a new representation. Unfortunately, this is necessary to avoid the size explosions described in those documents with the naive representations of PL and AD.
Thanks Dr. King for your explanation and help.
I have been working on university cluster that does not support Spark, so we can not run hail. And also for saving money we wish to run the LGT to GT conversion directly on cluster that do not have Hail available.
Do you have scripts or tools that I can use, to convert the VCF of LGT to normal VCF that with GT?
Thanks very much indeed.