Applying gnomAD Ancestry Methods to other Data

I’m brand new to Hail, so please bear with my ignorance as I get up to speed.

I see that ancestry inference for gnomAD is performed with Hail – and I’m interested in applying a similar semisupervised approach to a non-gnomAD dataset.

I understand the gnomAD team is using a set of some ~100k high quality variant sites and ~16k group labels to do this inference.

  1. Is there an easy way for me to obtain these sites and labels?
  2. IIRC, these sites have only one credible variant/SNP each. Are they being represented prior to PCA as a binary table?
  3. Because many of our samples will not include all the included sites, I expect I will need to reduce and retrain with subsets of these sites. To do that I’ll need to prepare the full table mentioned above that is being used in gnomAD. Is there a straightforward way to accomplish this?

Thanks for all your help. Looking forward to learning the ropes of Hail.

Bumping this in the hopes that someone can help out. Thanks in advance.

Hi @tobyrmanders ,

The gnomAD team has some commentary on how they do ancestry. I suspect they cannot share their set of population-labeled genomes because they can’t share the genomes. That said, gnomadv3 includes a public dataset of the HGDP and Thousand Genomes samples. I know there exist Thousand Genome population labels. I suspect HGDP also has publicly available population labels.

For question two, gnomAD “splits” multi-allelic sites into two or more biallelic variants. The data is represented as a Hail MatrixTable of biallelic genotype calls which are interpreted as 0, 1, or 2 (the number of alternate alleles).

I’m not sure I fully understand question three. Can you ask it again in light of the above?