Iterate over rows

Hi Hail fellows,
How can I iterate over rows of a hail table or matrix table?

Thanks!

Hi,

What do you mean by “iterate”? Hail tables and matrix tables don’t provide an interface that allows you to write a python while loop over the rows / entries. Instead, there are various functions depending on what you want to do (like the aggregate and annotate on tables)

The Hail cheat sheets summarize many of these functions: https://hail.is/docs/0.2/cheatsheets.html

I’d look at the table one first, as it’s simpler / more complete.

And if your table is small you can do:

for row in table.collect():
    ....

but beware this requires the whole table to fit in memory.

Thanks for the response.
I want to calculate a SNP by SNP (row by row) correlation coefficient between 2 datasets. Meaning applying some correlation function (e.g. Pearson’s) to the vector of genotype dosage from hail table 1 and the vector of the same SNP from ht2, and generate a SNP genotype correlation matrix.

I know annotate, but could was not able to apply a function that will take 2 vectors from 2 different tables. I also found a build in correlation function but it also only takes the rows from only one ht, and is more of a LD calculation function.

Thanks,
Or

A few questions:

  • is your data stored in Table or MatrixTable format?
  • can you share .describe() on both tables/matrix tables?

Assuming your data is in MatrixTable format and the two datasets have the same SNPs and same sample ids:

In [8]: import hail as hl 
   ...: mt = hl.balding_nichols_model(3, 1000, 1000) 
   ...: mt2 = hl.balding_nichols_model(3, 1000, 1000) 
   ...: mt = mt.annotate_entries(n_alt1 = mt.GT.n_alt_alleles()) 
   ...: mt = mt.annotate_entries(n_alt2 = mt2[mt.row_key, mt.col_key].GT.n_alt_alleles()) 
   ...: mt = mt.annotate_rows( 
   ...:     stats1 = hl.agg.stats(mt.n_alt1), 
   ...:     stats2 = hl.agg.stats(mt.n_alt2) 
   ...: ) 
   ...: mt = mt.annotate_rows( 
   ...:     pearson_correlation_coefficient =  
   ...:         hl.agg.sum((mt.n_alt1 - mt.stats1.mean) / mt.stats1.stdev * (mt.n_alt2 - mt.stats2.mean) / mt.stats2.stdev) 
   ...: ) 
   ...: mt.pearson_correlation_coefficient.show()                                                                                              
2020-02-08 12:55:38 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 1000 samples, and 1000 variants...
2020-02-08 12:55:38 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 1000 samples, and 1000 variants...
2020-02-08 12:55:39 Hail: INFO: Coerced sorted dataset
2020-02-08 12:55:39 Hail: INFO: Coerced sorted dataset
+---------------+------------+---------------------------------+
| locus         | alleles    | pearson_correlation_coefficient |
+---------------+------------+---------------------------------+
| locus<GRCh37> | array<str> |                         float64 |
+---------------+------------+---------------------------------+
| 1:1           | ["A","C"]  |                        1.06e+00 |
| 1:2           | ["A","C"]  |                        6.57e+00 |
| 1:3           | ["A","C"]  |                        5.16e+00 |
| 1:4           | ["A","C"]  |                        1.26e+01 |
| 1:5           | ["A","C"]  |                        3.66e+01 |
| 1:6           | ["A","C"]  |                        2.45e+01 |
| 1:7           | ["A","C"]  |                       -3.04e+00 |
| 1:8           | ["A","C"]  |                       -1.23e+01 |
| 1:9           | ["A","C"]  |                       -3.05e+00 |
| 1:10          | ["A","C"]  |                       -3.40e+01 |
| 1:11          | ["A","C"]  |                        1.07e+01 |
| 1:12          | ["A","C"]  |                       -4.06e+00 |
| 1:13          | ["A","C"]  |                        3.46e+01 |
| 1:14          | ["A","C"]  |                        9.78e+00 |
| 1:15          | ["A","C"]  |                        1.98e+01 |
| 1:16          | ["A","C"]  |                       -1.32e+01 |
| 1:17          | ["A","C"]  |                       -1.64e+01 |
| 1:18          | ["A","C"]  |                       -2.62e+01 |
| 1:19          | ["A","C"]  |                       -5.15e+01 |
| 1:20          | ["A","C"]  |                       -3.47e+00 |
| 1:21          | ["A","C"]  |                       -2.94e+01 |
+---------------+------------+---------------------------------+
showing top 21 rows

You could perform a similar calculation if you had tables and arrays of calls using hl.agg.array_agg to apply aggregators to arrays.

What we do above is this:

  • generate two random datasets
  • create entry fields n_alt1 and n_alt2 which represent the number of alternate alleles (thus convergent a genotype call to a number) in mt and mt2, respectively.
  • compute the mean and standard deviation of the aforementioned fields
  • compute the Pearson correlation coefficient by taking the sum of the products of the mean-centered and variance normalized alternate-allele-counts.

Thanks Dan, i’ll try it.
My data is in a hail table, and the SNPs are not all the same (so I need to filter them to keep the inner join and sort the dataset).