Support for phased genotypes

Hey folks! Am I correct in assuming that Hail currently doesn’t store phase information when importing genotypes? Looks from the documentation of Genotype.call() method that you’d convert input GT fields into an int referencing a “lower triangular” matrix representation for genotypes, so effectively not distinguishing between a “0|1” or “1|0” genotype. Am I correct? If so, is supporting phase information somewhere on your future roadmap?
Thanks!
Guillermo

Supporting phase information is indeed on the roadmap. In the 0.2 release of Hail, we won’t be tied to the GATK-style genotype fields. However, to work well with phased information, methods will need to actually use the phase. Maybe @tpoterba or @jigold can comment more on how that might look. I imagine we’d represent the phased genotypes as an index to the complete matrix of genotypes rather than the lower triangle.

Two questions. Can we use Hail now for phased genotypes? Also is anyone know if the gnomAD v3 dataset is phased or not? Thank you.

Puya

Hail can represent phased calls, but what do you want to do with them in particular?

Also, gnomAD doesn’t release individual-level genotype information, only summary statistics.

ahh, thanks. Didn’t know it was just summary statistics. I was hoping to be able to take the gnomAD data and phase it before using it for imputation but looks like I will have to just use the 1k genomes data. Such a shame you can’t get individual-level data anywhere else to improve imputation accuracy.

Also a quick Hail question. Can you break datasets up? Currently, PLINK’s memory footprint is massive so just wondering if you can use Hail for things like this. Sorry for the questions but I am a physician-scientist and not a bioinformatician. Appreciate all the help.

Yeah, I think other public datasets would help a lot, but I don’t know of much else.

Hail is built to process data out-of-core in order to scale, and doesn’t localize an entire dataset in memory the way PLINK does. Hail could analyze a 500GB dataset on a laptop, though the runtime would still be higher than if you used a cluster.

1 Like