I’m brand new to Hail, so please bear with my ignorance as I get up to speed.
I see that ancestry inference for gnomAD is performed with Hail – and I’m interested in applying a similar semisupervised approach to a non-gnomAD dataset.
I understand the gnomAD team is using a set of some ~100k high quality variant sites and ~16k group labels to do this inference.
- Is there an easy way for me to obtain these sites and labels?
- IIRC, these sites have only one credible variant/SNP each. Are they being represented prior to PCA as a binary table?
- Because many of our samples will not include all the included sites, I expect I will need to reduce and retrain with subsets of these sites. To do that I’ll need to prepare the full table mentioned above that is being used in gnomAD. Is there a straightforward way to accomplish this?
Thanks for all your help. Looking forward to learning the ropes of Hail.