Support for phased genotypes

gdelangel · October 24, 2017, 4:32pm

Hey folks! Am I correct in assuming that Hail currently doesn’t store phase information when importing genotypes? Looks from the documentation of Genotype.call() method that you’d convert input GT fields into an int referencing a “lower triangular” matrix representation for genotypes, so effectively not distinguishing between a “0|1” or “1|0” genotype. Am I correct? If so, is supporting phase information somewhere on your future roadmap?
Thanks!
Guillermo

danking · October 24, 2017, 6:41pm

Supporting phase information is indeed on the roadmap. In the 0.2 release of Hail, we won’t be tied to the GATK-style genotype fields. However, to work well with phased information, methods will need to actually use the phase. Maybe @tpoterba or @jigold can comment more on how that might look. I imagine we’d represent the phased genotypes as an index to the complete matrix of genotypes rather than the lower triangle.

pyazdi · September 17, 2020, 11:13pm

Two questions. Can we use Hail now for phased genotypes? Also is anyone know if the gnomAD v3 dataset is phased or not? Thank you.

Puya

tpoterba · September 18, 2020, 3:53am

Hail can represent phased calls, but what do you want to do with them in particular?

Also, gnomAD doesn’t release individual-level genotype information, only summary statistics.

pyazdi · September 18, 2020, 9:33pm

ahh, thanks. Didn’t know it was just summary statistics. I was hoping to be able to take the gnomAD data and phase it before using it for imputation but looks like I will have to just use the 1k genomes data. Such a shame you can’t get individual-level data anywhere else to improve imputation accuracy.

Also a quick Hail question. Can you break datasets up? Currently, PLINK’s memory footprint is massive so just wondering if you can use Hail for things like this. Sorry for the questions but I am a physician-scientist and not a bioinformatician. Appreciate all the help.

tpoterba · September 19, 2020, 1:50am

Yeah, I think other public datasets would help a lot, but I don’t know of much else.

Hail is built to process data out-of-core in order to scale, and doesn’t localize an entire dataset in memory the way PLINK does. Hail could analyze a 500GB dataset on a laptop, though the runtime would still be higher than if you used a cluster.

Topic		Replies	Views
Genotypic Phase Hail Query & hailctl	4	621	August 21, 2023
Partially missing genotypes Hail Query & hailctl	3	506	November 15, 2019
Big picture issues: considering switching to HAIL Meta	6	4007	January 3, 2023
Dephase genotypes Hail Query & hailctl	0	316	August 16, 2023
Error summary: HailException: Only support ploidy == 2 and unphased. Found 1\|1 Hail Query & hailctl	2	568	December 11, 2020

Support for phased genotypes

Related topics