Regression with multiple phenotypes with varying degrees of missingness

A basic coding question.

I’m trying to run a phewas with hundreds of phenotypes. My plan is to use linear_regression_rows. Some of the phenotypes have many missing values; others do not. Thus, I want my outcome variable to be a list of lists, not a single list, so that each phenotype is treated independently. If I had three phenotypes, I’d probably write the code as follows:

phewas = hl.linear_regression_rows(y=[[mt.pheno.A],[mt.pheno.B],[mt.pheno.C]], x=hl.gp_dosage(mt.GP),covariates=[1.0,mt.pheno.age,])

With hundreds of phenotypes, however, it is difficult to manually list them all out. How can I efficiently create a list of lists that I can use as the y parameter in linear_regression_rows?


First, let me clarify that if two phenotypes appear in the same inner list, e.g. A and B in [[A, B], [C]], they are treated as independent regressions (the same as if we had written [[A],[B],[C]]), but are computed in a more computationally efficient way. The only requirement for appearing in the same inner list is a shared missingness pattern. Essentially, this expression should evaluate to true:

hl.is_defined(mt.pheno.A).collect() == hl.is_defined(mt.pheno.B).collect()

If every phenotype has a different missingness pattern, you can easily produce a list of independent lists with this python expression:

[[mt.pheno[x]] for x in mt.pheno]

This works because any “struct” (aka nested) annotation in Hail can be used as if it is a list of its children:

In [11]: mt = hl.balding_nichols_model(3,10,10) 
2020-03-20 13:59:59 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 10 samples, and 10 variants...
        struct {
        n_populations: int32, 
        n_samples: int32, 
        n_variants: int32, 
        n_partitions: int32, 
        pop_dist: array<int32>, 
        fst: array<float64>, 
        mixture: bool
    <hail.matrixtable.MatrixTable object at 0x120b2c4d0>

In [12]: list(                                                                                                                                                                                     

Thanks, Dan. Each phenotype has a unique pattern of missingness, so the python expression you mention is just what I was trying to come up with.

I dont know i can buzz in or not but when u run linear regression one at a time. it will consider missingness for each phenotype seperately. say if a phenotype have n values out of m samples, while running regression it will filter samples which doesnt have values in phenotypes and do regression based on n values. Thats what i observer for my analysis. How many penotypes are you running on? @Mark . ps: sorry if i buzz with wrong answer. U can use R to get all missing values for n number of phenotypes columns.

Here’s what I tried:

phens = [[mt.pheno[x]] for x in mt.pheno]
phewas = hl.linear_regression_rows(y=phens, x=hl.gp_dosage(mt.GP),covariates=[1.0,mt.covariates.age,])

Unfortunately, I got the following error:

ERROR:root:Exception while sending command.
Traceback (most recent call last):
File “/usr/lib/spark/python/lib/”, line 1159, in send_command
raise Py4JNetworkError(“Answer from Java side is empty”)
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/lib/spark/python/lib/”, line 985, in send_command
response = connection.send_command(command)
File “/usr/lib/spark/python/lib/”, line 1164, in send_command
“Error while receiving”, e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

It also said:

Py4JError: An error occurred while calling o102.pyPersist.

Any idea why this might be happening?

I don’t know if it’s relevant, but all my column names in the phenotype matrix are just numbers. Could that be causing a problem?