Apologies in advance for misusing terminology, I am rather new to genetics work. My ultimate goal is to generate intermediate data so that I can perform my own GWAS with a prototype niche system I am developing that uses Hail for filtering and preprocessing. Previously I used the 1kg sample data from the GWAS tutorial and the annotations (isFemale and PurpleHair) for logistic regression with one covariant. This worked well for development purposes, but now I wish to use a lot more data to test my GWAS system. In the schema for your datasets it looks like only the allele information is available, and I could not find where the matching annotations/variant data might be.
My question is: where can I find the annotation/variant data and how do I load this data and merge it with the allele data?
Hopefully this makes sense, please let me know if I am misunderstanding something and the answer is right under my nose!
What kind of variant annotations do you need? Hail team maintains a pretty extensive set of variant annotations which you can join/annotate onto your genetic data. If you need something else, you’d want to import it as a Hail Table and then use annotate_rows to add it to your genetic data:
t = hl.import_table(...)
mt = mt.annotate_rows(new_variant_annotation = t.the_column_of_interest)
I am looking for three variants: one for logistic regression, one for linear regression, and one random variant I can use for a covariant (such as gender.) I am a little confused because I do not see variants for any of the 1kg datasets?
Hmm. I think we have a terminology misunderstanding here.
In Hail’s genetics vocabulary (which sometimes differs from mainstream genetics), the term “variant” refers to a position on the genome, the reference allele at that position, and one or more alternate alleles observed at that position. Variants are often written like this:
1:1000:A:T
This refers to chromosome 1, 1000 base pairs from the beginning, reference allele A and alternate allele T.
In Hail, a variant is usually identified by two “row fields”: “locus” and “alleles”. Sometimes the entire row is referred to as “the variant”. Sometimes just the identifying data are.
In Hail’s genetics vocabulary, a “sample” refers to a sample of genetic material that has been sequenced and added to our dataset. A sample is usually identified by a string stored in the column field “s”. Often the entire column is referred to as “the sample”.
—
I assume you’re using the full 1kg data from the Datasets API? To my knowledge, That dataset was not collected with phenotypes (ie information that varies per-sample). Your examples (binarized sex, hair color) are phenotypes (aka sample metadata, aka sample annotations). Those two annotations were created by the hail team for the tutorial using the statistical distributions in Hail | Random functions . You could also create some random phenotypes for your purposes.
If you need a semi-public dataset with real phenotypes you might try the UK BioBank genotypes.
—
Does that help?
1 Like
I completely have the vocabulary mixed up, thank you for clarifying. Yes, I am looking for phenotype information. I will look into the UK BioBank for that and the random functions. Thank you for taking the time to help me!