Hi Hail fellows,
How can I iterate over rows of a hail table or matrix table?
Thanks!
Hi Hail fellows,
How can I iterate over rows of a hail table or matrix table?
Thanks!
Hi,
What do you mean by “iterate”? Hail tables and matrix tables don’t provide an interface that allows you to write a python while loop over the rows / entries. Instead, there are various functions depending on what you want to do (like the aggregate
and annotate
on tables)
The Hail cheat sheets summarize many of these functions: https://hail.is/docs/0.2/cheatsheets.html
I’d look at the table one first, as it’s simpler / more complete.
And if your table is small you can do:
for row in table.collect():
....
but beware this requires the whole table to fit in memory.
Thanks for the response.
I want to calculate a SNP by SNP (row by row) correlation coefficient between 2 datasets. Meaning applying some correlation function (e.g. Pearson’s) to the vector of genotype dosage from hail table 1 and the vector of the same SNP from ht2, and generate a SNP genotype correlation matrix.
I know annotate, but could was not able to apply a function that will take 2 vectors from 2 different tables. I also found a build in correlation function but it also only takes the rows from only one ht, and is more of a LD calculation function.
Thanks,
Or
A few questions:
.describe()
on both tables/matrix tables?Assuming your data is in MatrixTable format and the two datasets have the same SNPs and same sample ids:
In [8]: import hail as hl
...: mt = hl.balding_nichols_model(3, 1000, 1000)
...: mt2 = hl.balding_nichols_model(3, 1000, 1000)
...: mt = mt.annotate_entries(n_alt1 = mt.GT.n_alt_alleles())
...: mt = mt.annotate_entries(n_alt2 = mt2[mt.row_key, mt.col_key].GT.n_alt_alleles())
...: mt = mt.annotate_rows(
...: stats1 = hl.agg.stats(mt.n_alt1),
...: stats2 = hl.agg.stats(mt.n_alt2)
...: )
...: mt = mt.annotate_rows(
...: pearson_correlation_coefficient =
...: hl.agg.sum((mt.n_alt1 - mt.stats1.mean) / mt.stats1.stdev * (mt.n_alt2 - mt.stats2.mean) / mt.stats2.stdev)
...: )
...: mt.pearson_correlation_coefficient.show()
2020-02-08 12:55:38 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 1000 samples, and 1000 variants...
2020-02-08 12:55:38 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 1000 samples, and 1000 variants...
2020-02-08 12:55:39 Hail: INFO: Coerced sorted dataset
2020-02-08 12:55:39 Hail: INFO: Coerced sorted dataset
+---------------+------------+---------------------------------+
| locus | alleles | pearson_correlation_coefficient |
+---------------+------------+---------------------------------+
| locus<GRCh37> | array<str> | float64 |
+---------------+------------+---------------------------------+
| 1:1 | ["A","C"] | 1.06e+00 |
| 1:2 | ["A","C"] | 6.57e+00 |
| 1:3 | ["A","C"] | 5.16e+00 |
| 1:4 | ["A","C"] | 1.26e+01 |
| 1:5 | ["A","C"] | 3.66e+01 |
| 1:6 | ["A","C"] | 2.45e+01 |
| 1:7 | ["A","C"] | -3.04e+00 |
| 1:8 | ["A","C"] | -1.23e+01 |
| 1:9 | ["A","C"] | -3.05e+00 |
| 1:10 | ["A","C"] | -3.40e+01 |
| 1:11 | ["A","C"] | 1.07e+01 |
| 1:12 | ["A","C"] | -4.06e+00 |
| 1:13 | ["A","C"] | 3.46e+01 |
| 1:14 | ["A","C"] | 9.78e+00 |
| 1:15 | ["A","C"] | 1.98e+01 |
| 1:16 | ["A","C"] | -1.32e+01 |
| 1:17 | ["A","C"] | -1.64e+01 |
| 1:18 | ["A","C"] | -2.62e+01 |
| 1:19 | ["A","C"] | -5.15e+01 |
| 1:20 | ["A","C"] | -3.47e+00 |
| 1:21 | ["A","C"] | -2.94e+01 |
+---------------+------------+---------------------------------+
showing top 21 rows
You could perform a similar calculation if you had tables and array
s of calls using hl.agg.array_agg
to apply aggregators to arrays.
What we do above is this:
n_alt1
and n_alt2
which represent the number of alternate alleles (thus convergent a genotype call to a number) in mt
and mt2
, respectively.Thanks Dan, i’ll try it.
My data is in a hail table, and the SNPs are not all the same (so I need to filter them to keep the inner join and sort the dataset).