Hi,
After importing a gen how do I write genotype dosages to a file or pandas dataframe? I found the dosage method mentioned in the documentation, but I still don’t know how to go about doing this.
Can you be a bit more specific about what you want to do with the dosages? Moving data from Hail / Spark into a pandas dataframe is very inefficient.
Yes, I see. I would like to compare (e.g. by correlation) two sets of dosages with each another.
Got it. This isn’t something you can do in 0.1 (definitely will be possible in the next version).
I think make_table can make the text file you want:
# vds is the dataset from the imported .gen
table = vds.make_table('v = v', ['`` = g.dosage']) # make a table with one col per sample
table.export('file.tsv')
Then you can read it in with pandas / anything else.
Excellent, thank you
I noticed that the genotype to dosage conversion is erroneous in some cases. For example, a genotype listed in the gen file:
22 --- rs16981741 17309881 A G 0.999982 1.78e-05 0
is converted to the following dosage:
chr pos ref altAlleles 1
0 22 17309881 A [(A, G)] 0.000031
Hail’s numerical precision for genotype probabilities in 0.1 is 1 / 2**15
. This value is getting rounded to the smallest possible number:
In [22]: 1.0 / (2 ** 15)
Out[22]: 3.0517578125e-05