Applying gnomAD Ancestry Methods to other Data

tobyrmanders · May 28, 2021, 6:47pm

I’m brand new to Hail, so please bear with my ignorance as I get up to speed.

I see that ancestry inference for gnomAD is performed with Hail – and I’m interested in applying a similar semisupervised approach to a non-gnomAD dataset.

I understand the gnomAD team is using a set of some ~100k high quality variant sites and ~16k group labels to do this inference.

Is there an easy way for me to obtain these sites and labels?
IIRC, these sites have only one credible variant/SNP each. Are they being represented prior to PCA as a binary table?
Because many of our samples will not include all the included sites, I expect I will need to reduce and retrain with subsets of these sites. To do that I’ll need to prepare the full table mentioned above that is being used in gnomAD. Is there a straightforward way to accomplish this?

Thanks for all your help. Looking forward to learning the ropes of Hail.

tobyrmanders · August 2, 2021, 9:26pm

Bumping this in the hopes that someone can help out. Thanks in advance.

danking · August 2, 2021, 10:21pm

Hi @tobyrmanders ,

The gnomAD team has some commentary on how they do ancestry. I suspect they cannot share their set of population-labeled genomes because they can’t share the genomes. That said, gnomadv3 includes a public dataset of the HGDP and Thousand Genomes samples. I know there exist Thousand Genome population labels. I suspect HGDP also has publicly available population labels.

For question two, gnomAD “splits” multi-allelic sites into two or more biallelic variants. The data is represented as a Hail MatrixTable of biallelic genotype calls which are interpreted as 0, 1, or 2 (the number of alternate alleles).

I’m not sure I fully understand question three. Can you ask it again in light of the above?

Topic		Replies	Views
Finding genotype for each (exome locus, sample ID) pair Hail Query & hailctl	5	532	October 30, 2018
Hail utilities for gnomAD in local cluster Hail Query & hailctl	3	469	November 17, 2020
Help for annotating a matrixtable variant data in DNAnexus with gnomAD database Hail Query & hailctl	11	488	February 9, 2023
GnomAD Data in Hail 0.2 Help [0.1]	1	1191	July 16, 2018
Annotating variants in a matrix table with 1000genomes database Hail Query & hailctl	0	342	April 20, 2023

Applying gnomAD Ancestry Methods to other Data

Related topics