Hi Hail fellows,

How can I iterate over rows of a hail table or matrix table?

Thanks!

Hi Hail fellows,

How can I iterate over rows of a hail table or matrix table?

Thanks!

Hi,

What do you mean by “iterate”? Hail tables and matrix tables don’t provide an interface that allows you to write a python while loop over the rows / entries. Instead, there are various functions depending on what you want to do (like the `aggregate`

and `annotate`

on tables)

The Hail cheat sheets summarize many of these functions: https://hail.is/docs/0.2/cheatsheets.html

I’d look at the table one first, as it’s simpler / more complete.

And if your table is small you can do:

```
for row in table.collect():
....
```

but beware this requires the whole table to fit in memory.

Thanks for the response.

I want to calculate a SNP by SNP (row by row) correlation coefficient between 2 datasets. Meaning applying some correlation function (e.g. Pearson’s) to the vector of genotype dosage from hail table 1 and the vector of the same SNP from ht2, and generate a SNP genotype correlation matrix.

I know annotate, but could was not able to apply a function that will take 2 vectors from 2 different tables. I also found a build in correlation function but it also only takes the rows from only one ht, and is more of a LD calculation function.

Thanks,

Or

A few questions:

- is your data stored in Table or MatrixTable format?
- can you share
`.describe()`

on both tables/matrix tables?

Assuming your data is in MatrixTable format and the two datasets have the same SNPs and same sample ids:

```
In [8]: import hail as hl
...: mt = hl.balding_nichols_model(3, 1000, 1000)
...: mt2 = hl.balding_nichols_model(3, 1000, 1000)
...: mt = mt.annotate_entries(n_alt1 = mt.GT.n_alt_alleles())
...: mt = mt.annotate_entries(n_alt2 = mt2[mt.row_key, mt.col_key].GT.n_alt_alleles())
...: mt = mt.annotate_rows(
...: stats1 = hl.agg.stats(mt.n_alt1),
...: stats2 = hl.agg.stats(mt.n_alt2)
...: )
...: mt = mt.annotate_rows(
...: pearson_correlation_coefficient =
...: hl.agg.sum((mt.n_alt1 - mt.stats1.mean) / mt.stats1.stdev * (mt.n_alt2 - mt.stats2.mean) / mt.stats2.stdev)
...: )
...: mt.pearson_correlation_coefficient.show()
2020-02-08 12:55:38 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 1000 samples, and 1000 variants...
2020-02-08 12:55:38 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 1000 samples, and 1000 variants...
2020-02-08 12:55:39 Hail: INFO: Coerced sorted dataset
2020-02-08 12:55:39 Hail: INFO: Coerced sorted dataset
+---------------+------------+---------------------------------+
| locus | alleles | pearson_correlation_coefficient |
+---------------+------------+---------------------------------+
| locus<GRCh37> | array<str> | float64 |
+---------------+------------+---------------------------------+
| 1:1 | ["A","C"] | 1.06e+00 |
| 1:2 | ["A","C"] | 6.57e+00 |
| 1:3 | ["A","C"] | 5.16e+00 |
| 1:4 | ["A","C"] | 1.26e+01 |
| 1:5 | ["A","C"] | 3.66e+01 |
| 1:6 | ["A","C"] | 2.45e+01 |
| 1:7 | ["A","C"] | -3.04e+00 |
| 1:8 | ["A","C"] | -1.23e+01 |
| 1:9 | ["A","C"] | -3.05e+00 |
| 1:10 | ["A","C"] | -3.40e+01 |
| 1:11 | ["A","C"] | 1.07e+01 |
| 1:12 | ["A","C"] | -4.06e+00 |
| 1:13 | ["A","C"] | 3.46e+01 |
| 1:14 | ["A","C"] | 9.78e+00 |
| 1:15 | ["A","C"] | 1.98e+01 |
| 1:16 | ["A","C"] | -1.32e+01 |
| 1:17 | ["A","C"] | -1.64e+01 |
| 1:18 | ["A","C"] | -2.62e+01 |
| 1:19 | ["A","C"] | -5.15e+01 |
| 1:20 | ["A","C"] | -3.47e+00 |
| 1:21 | ["A","C"] | -2.94e+01 |
+---------------+------------+---------------------------------+
showing top 21 rows
```

You could perform a similar calculation if you had tables and `array`

s of calls using `hl.agg.array_agg`

to apply aggregators to arrays.

What we do above is this:

- generate two random datasets
- create entry fields
`n_alt1`

and`n_alt2`

which represent the number of alternate alleles (thus convergent a genotype call to a number) in`mt`

and`mt2`

, respectively. - compute the mean and standard deviation of the aforementioned fields
- compute the Pearson correlation coefficient by taking the sum of the products of the mean-centered and variance normalized alternate-allele-counts.

Thanks Dan, i’ll try it.

My data is in a hail table, and the SNPs are not all the same (so I need to filter them to keep the inner join and sort the dataset).