I need to do a roll_sum of the dosages, is there a way to do this in hail?

garyzhubc · December 1, 2020, 8:18pm

So I have a PLINK file from UKBiobank and I want to extract a matrix A of dosages with each row as a patient and each column as a SNP as:

|--------|rs001|rs002|
|patient1| A11 | A22 |
|patient2| A33 | A44 |

where A11 is the dosage of patient1 at allele rs001. As an example it can be A11=00.3+10.2+2*0.5.

Then I want to do the following operations:

First perform a matrix multiplication A*x[None,:] with a vector x and output a matrix of the same size as A. I’m using the numpy broadcasting notation here.
Then perform a roll_sum (R function documentation) over the columns to create a new matrix consisting of collapsed SNPs. This returns a matrix of the same number of rows as A but the number of columns is reduced to the number of rows of A subtracting off the window size.
Export this matrix into a new binary file or do a regression analysis with this new matrix

As mentioned in a later post, this entire operation can be seen as roll_sum(A*x[newaxis,:], axis=1, window_size=10).

Is there a tutorial on how can I perform this sequence of operations with hail?

    Singularity> python
    Python 3.6.9 (default, Nov 23 2019, 07:02:27) 
    [GCC 6.3.0 20170516] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import hail as hl
    >>> ds = hl.import_plink("/scratch/zhupy/ukb_imp_chr1_v3_pruned.bed","/scratch/zhupy/ukb_imp_chr1_v3_pruned.bim","/scratch/zhupy/ukb_imp_chr1_v3_pruned.fam")
    Initializing Hail with default parameters...
    2020-12-02 16:30:31 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    2020-12-02 16:30:33 WARN  Hail:37 - This Hail JAR was compiled for Spark 2.4.5, running with Spark 2.4.1.
      Compatibility is not guaranteed.
    Running on Apache Spark version 2.4.1
    SparkUI available at http://cdr860.int.cedar.computecanada.ca:4040
    Welcome to
         __  __     <>__
        / /_/ /__  __/ /
       / __  / _ `/ / /
      /_/ /_/\_,_/_/_/   version 0.2.60-de1845e1c2f6
    LOGGING: writing to /scratch/zhupy/hail-20201202-0830-0.2.60-de1845e1c2f6.log
    2020-12-02 16:30:48 Hail: INFO: Found 487409 samples in fam file.
    2020-12-02 16:30:48 Hail: INFO: Found 1220764 variants in bim file.
    >>> type(ds)
    <class 'hail.matrixtable.MatrixTable'>

johnc1231 · December 1, 2020, 9:16pm

Disclaimer: not a geneticist

I don’t think I understand what you’re asking. Where are the dosages / SNPs coming from in this case? A .fam file is just some sample information right?

garyzhubc · December 1, 2020, 9:33pm

Hi @johnc1231 Thanks for your reply. You can basically ignore that information. I am just doing something like roll_sum(A*x[newaxis,:], axis=1, window_size=10) for a genetic matrix A and numerical vector x. Yes the .fam file is just sample information.

danking · December 2, 2020, 1:03am

Hey @garyzhubc,

Transposing the UKB matrix from variant by sample to sample by variant is expensive. I recommend avoiding that.

I’m not sure I understand how operation 1 and operation 2 are related. Suppose we do not transpose the genetic matrix. Call it G. Your first operation,

v = x * G

Produces a vector, v, with an entry for each sample. I don’t see how you could perform a rolling sum over that to produce collapsed SNPs.

If you really want a sample-vector that is the dot-product of the variants with some vector, v, load the vector v as a table and:

v = hl.read_table(...)
mt = mt.annotate_cols(
    sample_vector_value = hl.agg.sum(
        mt.GT.n_alt_alleles() * v[mt.locus].the_field))

the_sample_vector_as_a_python_list = mt.sample_vector_value.collect()

As to the more direct question about rolling sums, Hail does not have good support for windowed aggregators. With current Hail technology, this operation requires a “shuffle” which is expensive and slow. You can try this and see if it is sufficiently fast for your purpose, replacing the random balding nichols dataset with your UKB matrix.

WINDOW_SIZE = 6
WINDOW_DIFF = WINDOW_SIZE // 2

mt = hl.balding_nichols_model(1, 3, 10) 
mt.show() 
mt = mt.annotate_rows(
    my_windows = hl.range(hl.max(1, mt.locus.position - WINDOW_DIFF),
                          mt.locus.position + WINDOW_DIFF))
mt = mt.explode_rows('my_windows')
mt = mt.annotate_rows(my_window = mt.my_windows + 1) 
mt = mt.annotate_rows(
    locus_window = hl.interval(hl.locus(mt.locus.contig, hl.max(1, mt.my_window - WINDOW_DIFF)),
                               hl.locus(mt.locus.contig, mt.my_window + WINDOW_DIFF)))
mt = mt.group_rows_by(mt.locus_window).aggregate(n_alleles_in_window = hl.agg.sum(mt.GT.n_alt_alleles())) 
mt.show()

Output:

+---------------+------------+------+------+------+
| locus         | alleles    | 0.GT | 1.GT | 2.GT |
+---------------+------------+------+------+------+
| locus<GRCh37> | array<str> | call | call | call |
+---------------+------------+------+------+------+
| 1:1           | ["A","C"]  | 0/1  | 0/0  | 0/1  |
| 1:2           | ["A","C"]  | 0/1  | 1/1  | 0/1  |
| 1:3           | ["A","C"]  | 1/1  | 0/1  | 1/1  |
| 1:4           | ["A","C"]  | 0/0  | 0/0  | 0/0  |
| 1:5           | ["A","C"]  | 0/1  | 0/0  | 0/1  |
| 1:6           | ["A","C"]  | 1/1  | 0/1  | 0/1  |
| 1:7           | ["A","C"]  | 0/1  | 1/1  | 0/0  |
| 1:8           | ["A","C"]  | 1/1  | 1/1  | 1/1  |
| 1:9           | ["A","C"]  | 0/0  | 1/1  | 0/1  |
| 1:10          | ["A","C"]  | 0/0  | 0/1  | 0/1  |
+---------------+------------+------+------+------+
+-------------------------+-----------------------+-----------------------+-----------------------+
| locus_window            | 0.n_alleles_in_window | 1.n_alleles_in_window | 2.n_alleles_in_window |
+-------------------------+-----------------------+-----------------------+-----------------------+
| interval<locus<GRCh37>> |                 int64 |                 int64 |                 int64 |
+-------------------------+-----------------------+-----------------------+-----------------------+
| [1:1-1:5)               |                     4 |                     3 |                     4 |
| [1:1-1:6)               |                     5 |                     3 |                     5 |
| [1:1-1:7)               |                     7 |                     4 |                     6 |
| [1:2-1:8)               |                     7 |                     6 |                     5 |
| [1:3-1:9)               |                     8 |                     6 |                     6 |
| [1:4-1:10)              |                     6 |                     7 |                     5 |
| [1:5-1:11)              |                     6 |                     8 |                     6 |
| [1:6-1:12)              |                     5 |                     8 |                     5 |
| [1:7-1:13)              |                     3 |                     7 |                     4 |
| [1:8-1:14)              |                     2 |                     5 |                     4 |
| [1:9-1:15)              |                     0 |                     3 |                     2 |
| [1:10-1:16)             |                     0 |                     1 |                     1 |
+-------------------------+-----------------------+-----------------------+-----------------------+

For regression analysis and exporting, try searching the Hail documentation, we support a number of file formats and simple regression models.

garyzhubc · December 2, 2020, 4:41pm

Hi @danking sorry I shouldn’t have write it as a dot product. It should be a broadcasted matrix-vector multiplication to get a new matrix of the same size as the input matrix. I have edited my previous reply.

    Singularity> python
    Python 3.6.9 (default, Nov 23 2019, 07:02:27) 
    [GCC 6.3.0 20170516] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import hail as hl
    >>> ds = hl.import_plink("/scratch/zhupy/ukb_imp_chr1_v3_pruned.bed","/scratch/zhupy/ukb_imp_chr1_v3_pruned.bim","/scratch/zhupy/ukb_imp_chr1_v3_pruned.fam")
    Initializing Hail with default parameters...
    2020-12-02 16:30:31 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    2020-12-02 16:30:33 WARN  Hail:37 - This Hail JAR was compiled for Spark 2.4.5, running with Spark 2.4.1.
      Compatibility is not guaranteed.
    Running on Apache Spark version 2.4.1
    SparkUI available at http://cdr860.int.cedar.computecanada.ca:4040
    Welcome to
         __  __     <>__
        / /_/ /__  __/ /
       / __  / _ `/ / /
      /_/ /_/\_,_/_/_/   version 0.2.60-de1845e1c2f6
    LOGGING: writing to /scratch/zhupy/hail-20201202-0830-0.2.60-de1845e1c2f6.log
    2020-12-02 16:30:48 Hail: INFO: Found 487409 samples in fam file.
    2020-12-02 16:30:48 Hail: INFO: Found 1220764 variants in bim file.
    >>> type(ds)
    <class 'hail.matrixtable.MatrixTable'>

danking · December 2, 2020, 7:39pm

I think you have all the pieces you need. The rolling sum can be done with the code I shared above.

For the broadcasted Hadamard product, if you have a variant-indexed vector, then load it as a table t and do this:

mt = mt.annotate_entries(new_entry_value = mt.GT.n_alt_alleles() * t[mt.row_key].table_value)

If you have a sample-indexed vector, then:

mt = mt.annotate_entries(new_entry_value = mt.GT.n_alt_alleles() * t[mt.col_key].table_value)

You have to rephrase your operations in a more relational/SQL way. Hail currently lacks good support for the kind of linear algebra you’re trying to do. You might find the Hail Cheatsheets helpful.

If you’re really trying to do linear algebra on large matrices, you might have a better time with Dask.

garyzhubc · December 2, 2020, 10:09pm

Thanks. When trying this method on my dataset, an error occurred.

>>> import hail as hl
>>> mt = hl.import_plink("/scratch/zhupy/ukb_imp_chr1_v3_pruned.bed",
...                      "/scratch/zhupy/ukb_imp_chr1_v3_pruned.bim",
...                      "/scratch/zhupy/ukb_imp_chr1_v3_pruned.fam")
Initializing Hail with default parameters...
2020-12-02 21:54:27 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2020-12-02 21:54:31 WARN  Hail:37 - This Hail JAR was compiled for Spark 2.4.5, running with Spark 2.4.1.
  Compatibility is not guaranteed.
Running on Apache Spark version 2.4.1
SparkUI available at http://cdr469.int.cedar.computecanada.ca:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.60-de1845e1c2f6
LOGGING: writing to /scratch/zhupy/hail-20201202-1354-0.2.60-de1845e1c2f6.log
2020-12-02 21:54:55 Hail: INFO: Found 487409 samples in fam file.
2020-12-02 21:54:55 Hail: INFO: Found 1220764 variants in bim file.
>>> glm_result = hl.import_table("/home/zhupy/projects/def-kwalley/zhupy/hli/data/plink2.LDL.glm.linear.tsv",types={"BETA":hl.tfloat64}).select("BETA")
2020-12-02 21:55:04 Hail: INFO: Reading table without type imputation
  Loading field 'CHROM' as type str (not specified)
  Loading field 'POS' as type str (not specified)
  Loading field 'ID' as type str (not specified)
  Loading field 'REF' as type str (not specified)
  Loading field 'ALT' as type str (not specified)
  Loading field 'A1' as type str (not specified)
  Loading field 'TEST' as type str (not specified)
  Loading field 'OBS_CT' as type str (not specified)
  Loading field 'BETA' as type float64 (user-supplied)
  Loading field 'SE' as type str (not specified)
  Loading field 'T_STAT' as type str (not specified)
  Loading field 'P' as type str (not specified)
  Loading field 'ERRCODE' as type str (not specified)
>>> WINDOW_SIZE = 1000
>>> WINDOW_DIFF = WINDOW_SIZE // 2
>>> mt = mt.annotate_entries(new_entry_value = mt.GT.n_alt_alleles() * glm_result[mt.row_key].table_value)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/hail/table.py", line 1583, in index
    return self._index(*exprs, all_matches=all_matches)
  File "/usr/local/lib/python3.6/site-packages/hail/table.py", line 1654, in _index
    return self._index(*exprs[0].values(), all_matches=all_matches)
  File "/usr/local/lib/python3.6/site-packages/hail/table.py", line 1657, in _index
    raise TableIndexKeyError(self.key.dtype, exprs)
hail.table.TableIndexKeyError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/hail/table.py", line 375, in __getitem__
    return self.index(*wrap_to_tuple(item))
  File "/usr/local/lib/python3.6/site-packages/hail/table.py", line 1585, in index
    raise ExpressionException(f"Key type mismatch: cannot index table with given expressions:\n"
hail.expr.expressions.base_expression.ExpressionException: Key type mismatch: cannot index table with given expressions:
  Table key:         <<<empty key>>>
  Index Expressions: locus<GRCh37>, array<str>

I also couldn’t load the dataset that you mentioned.


>>> mt0 = hl.balding_nichols_model(1, 3, 10)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.6/site-packages/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<decorator-gen-1715>", line 2, in balding_nichols_model
  File "/usr/local/lib/python3.6/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/usr/local/lib/python3.6/site-packages/hail/methods/statgen.py", line 2443, in balding_nichols_model
    .format(n_populations, n_samples, n_variants))
  File "/usr/local/lib/python3.6/site-packages/hail/utils/java.py", line 168, in info
    Env.backend().logger.info(msg)
  File "/usr/local/lib/python3.6/site-packages/hail/backend/spark_backend.py", line 116, in info
    self._log_pkg.info(msg)
  File "/usr/local/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python3.6/site-packages/hail/backend/py4j_backend.py", line 16, in deco
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/py4j/protocol.py", line 336, in get_return_value
    format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o0.info

danking · December 2, 2020, 10:25pm

The second error indicates you haven’t successfully installed Hail. Are you certain you have done everything under the Linux installation instructions? The true error is present in the Hail log file. This Python error simply tells us that the JVM crashed.

The first error is quite clear: your table lacks a key. Hail cannot know how to line up rows of your table to rows of your genetic data if the table does not have variant information. Try modifying your table to include a column indicating the locus and alleles for each beta.

It looks like you’re trying to compute some sort of windowed polygenic score. You might find our polygenic score documentation helpful.

Topic		Replies	Views
Genotype matrix in hail 0.2 Hail Query & hailctl	5	698	April 15, 2019
Data transformation Hail Query & hailctl	1	351	March 23, 2022
Iterate over rows Hail Query & hailctl	5	1184	February 8, 2020
SNP dosages to numpy/pandas? Hail Query & hailctl	23	1515	September 5, 2022
Exporting matrix table with phased data to plink Hail Query & hailctl	2	564	November 19, 2018

I need to do a roll_sum of the dosages, is there a way to do this in hail?

Related Topics