PheWAS on DNAnexus UKB RAP

Hi all,

I’m trying to run a PheWAS (ideally using Hail) on the DNAnexus UKB RAP. Is anyone aware of any tutorials/code to show how to do this? I’ve not been able to find anything in Hail or DNAnexus documentation (and have posted a similar question on the DNAnexus forums).

I guess that most of the work will be to process the phenotypes into a matrixtable, and that actually running the regression should be fairly straightforward with linear_regression_rows.

Best,
Jeremy

1 Like

Hey @jbchang !

We don’t have any complete working examples. If you have concrete issues or questions we can gainfully answer those.

Some general comments to get you started:

  1. If all your phenotypes are in CSVs/TSVs, then getting the phenotypes onto your genetic data should be rather straightforward:
phenos1 = hl.import_table(...).key_by("sample_id")
phenos2 = hl.import_table(...).key_by("sample_id")
...
phenosN = hl.import_table(...).key_by("sample_id")

mt = mt.annotate_cols(
    **phenos1[mt.col_key],
    ...
    **phenosN[mt.col_key]
)
  1. You should group your phenotypes such within each group all phenotypes have the same missingness pattern. In particular, Hail will exclude a sample from every phenotype if that sample has even one undefined response variable or covariate. You can assess the missingness pattern by looking at the phenotype covariance matrix:
c = mt.cols()
c = c.select(pheno1 = hl.is_defined(c.pheno1), ...)
c = c.select(pheno_vector = [c.pheno1, c.pheno2, ....])
import numpy as np
pheno_defined_matrix = np.ndarray(c.pheno_vector.collect())
pheno_defined_covariance = m.T @ m
  1. Once you have groups, you can run a regression:
group1 = [mt.pheno1, mt.pheno3, ...]
group2 = ...
regression_results = hl.linear_regression_rows(
    y=[group1, group2, ...],
    x=mt.GT.n_alt_alleles(),
    covariates=[1.0, ...]
)

Thank you @danking! That is very useful—I think I should have the phenotypes ready pretty soon and will try out your suggestions.

Best,
Jeremy

1 Like

Hi @danking, hope you are doing well! I am wondering about the “grouped phenotype” feature of logistic_regression_rows()—was this perhaps this github pull request, which was released in Hail version 0.2.8?

If so, Hail on DNAnexus is at 0.2.7.8. So it seems like I may have to either (1) figure out how to install a more recent version of Hail or (2) figure out some other workaround, probably by creating one matrixtable of results per phenotype…?

Best,
Jeremy

Hey @jbchang !

logistic_regression_rows indeed supports multiple phenotypes but does not take advantage of phenotype groups the way linear_regression_rows does. In particular, in simple linear regression, we can “project away” the covariates from the genotype matrix. This reduces the amount of necessary per-row computation. I’m not sure if there is a mathematical reason we don’t implement this for logistic_regression_rows, but I suspect there is.

Hail follows semver, so we only use MAJOR.MINOR.PATCH versions, we never have a fourth version number. Are you sure DNANexus does not have 0.2.78? One year old is quite old by Hail standards, but it should support multiple phenotype logistic regression.

I strongly recommend asking DNANexus to update their Hail version if you need a new Hail version. I don’t know how they’ve set up their Spark clusters, so I’m hesitant to give instructions for upgrading. That said, in general, you can install Hail on an arbitrary Spark cluster by following the installation instructions.

Thank you! Oops, sorry, yes 0.2.78. I’ll get in touch with them to update their Hail version.

Hmm… then if this version does support multiple phenotype logistic, then I am not sure what the issue is. Let me dig around and see what the issue might be…

What error do you receive?

I re-ran the cell in the Jupyter notebook, so I’m not sure. I’m kicking myself for not recording it. FWIW, when I was working yesterday, a couple of times a cell would fail and I think the kernel restarted (the execution numbers started again from 1, and the variables were cleared).

Now, at least when I run logistic regression on a single phenotype, it appears to work fine (I’m trying using a larger cluster, since sometimes that seems to help). I’ll check the output, try multiple phenotypes, and report back.

Best,
Jeremy

1 Like

OK—coming back with two one questions! One on grouping the phenotypes properly, and another on slow execution speed (UPDATE: I figured out why execution was so slow… I wasn’t writing intermediate matrixtables with checkpoint()).

Missingness pattern issue

I think I am having trouble formatting the phenotypes properly so that I get around this missingness behavior you referred to in the second point of your post above. When I try to run the phenotypes either as a list or as a list of lists, I get errors:

Entering phenotypes as a single list (see code/output below), I think Hail is treating this as one group and is filtering out all samples that are missing for any of the phenotypes, as you said. The error that I get (y must be non constant) is probably happening because all of the 1 entries are getting filtered out. The the phenotypes (phecode_xx_x.has_phenotype) is a column field that is a 1 for samples that have the phecode, 0 if they don’t, and missing (I think) if they weren’t annotated. When we constructed the cases/controls, we excluded some samples from the control population if they had a confounding phenotype, so the missing patterns will be different for each of the phecodes.

regression_results1 = hl.logistic_regression_rows(
    test = 'wald',
    y=[burden_mt3.phecode_10_0.has_phenotype, burden_mt3.phecode_110_1.has_phenotype, burden_mt3.phecode_110_11.has_phenotype],
    x=burden_mt3.n_variants,
    covariates=[1.0,
                burden_mt3.sample_data.p22009_a1,
                burden_mt3.sample_data.p22009_a2,
                burden_mt3.sample_data.p22009_a3,
                burden_mt3.sample_data.p22009_a4,
                burden_mt3.sample_data.p22009_a5,
                burden_mt3.sample_data.p22009_a6,
                burden_mt3.sample_data.p22009_a7,
                burden_mt3.sample_data.p22009_a8,
                burden_mt3.sample_data.p22009_a9,
                burden_mt3.sample_data.p22009_a10]
)
2022-12-19 21:23:16 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:23:45 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:24:13 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:24:40 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:25:49 Hail: INFO: Coerced sorted dataset
2022-12-19 21:25:49 Hail: INFO: Coerced dataset with out-of-order partitions.
2022-12-19 21:27:05 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:27:32 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:28:00 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:28:27 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:29:33 Hail: INFO: Coerced sorted dataset
2022-12-19 21:29:33 Hail: INFO: Coerced dataset with out-of-order partitions.
2022-12-19 21:29:44 Hail: WARN: 99919 of 468412 samples have a missing phenotype or covariate.

---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-49-dc544c769d5c> in <module>
     13                 burden_mt3.sample_data.p22009_a8,
     14                 burden_mt3.sample_data.p22009_a9,
---> 15                 burden_mt3.sample_data.p22009_a10]
     16 )

<decorator-gen-1733> in logistic_regression_rows(test, y, x, covariates, pass_through)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/methods/statgen.py in logistic_regression_rows(test, y, x, covariates, pass_through)
    865         result = result.transmute(**result.logistic_regression[0])
    866 
--> 867     return result.persist()
    868 
    869 

<decorator-gen-1143> in persist(self, storage_level)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/table.py in persist(self, storage_level)
   1868             Persisted table.
   1869         """
-> 1870         return Env.backend().persist_table(self, storage_level)
   1871 
   1872     def unpersist(self) -> 'Table':

/opt/conda/lib/python3.6/site-packages/hail/backend/spark_backend.py in persist_table(self, t, storage_level)
    288 
    289     def persist_table(self, t, storage_level):
--> 290         return Table._from_java(self._jbackend.pyPersistTable(storage_level, self._to_java_table_ir(t._tir)))
    291 
    292     def unpersist_table(self, t):

/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     29                 raise FatalError('%s\n\nJava stack trace:\n%s\n'
     30                                  'Hail version: %s\n'
---> 31                                  'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     32         except pyspark.sql.utils.CapturedException as e:
     33             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: HailException: For logistic regression, y at index 1 must be non-constant

Java stack trace:
is.hail.utils.HailException: For logistic regression, y at index 1 must be non-constant
	at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:11)
	at is.hail.utils.package$.fatal(package.scala:78)
	at is.hail.methods.LogisticRegression$$anonfun$execute$1.apply$mcVI$sp(LogisticRegression.scala:45)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
	at is.hail.methods.LogisticRegression.execute(LogisticRegression.scala:40)
	at is.hail.expr.ir.functions.WrappedMatrixToTableFunction.execute(RelationalFunctions.scala:51)
	at is.hail.expr.ir.TableToTableApply.execute(TableIR.scala:2934)
	at is.hail.expr.ir.TableIR.analyzeAndExecute(TableIR.scala:57)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:27)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1$$anonfun$apply$9.apply(SparkBackend.scala:502)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1$$anonfun$apply$9.apply(SparkBackend.scala:501)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:47)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:46)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:46)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:501)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:493)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
	at is.hail.backend.spark.SparkBackend.pyPersistTable(SparkBackend.scala:493)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)



Hail version: 0.2.78-b17627756568
Error summary: HailException: For logistic regression, y at index 1 must be non-constant

Entering phenotypes as a list of lists (see code/output below), as you suggested in your post above, seems to just give me a syntax error.

regression_results1 = hl.logistic_regression_rows(
    test = 'wald',
    y=[[burden_mt3.phecode_10_0.has_phenotype], [burden_mt3.phecode_110_1.has_phenotype], [burden_mt3.phecode_110_11.has_phenotype]],
    x=burden_mt3.n_variants,
    covariates=[1.0,
                burden_mt3.sample_data.p22009_a1,
                burden_mt3.sample_data.p22009_a2,
                burden_mt3.sample_data.p22009_a3,
                burden_mt3.sample_data.p22009_a4,
                burden_mt3.sample_data.p22009_a5,
                burden_mt3.sample_data.p22009_a6,
                burden_mt3.sample_data.p22009_a7,
                burden_mt3.sample_data.p22009_a8,
                burden_mt3.sample_data.p22009_a9,
                burden_mt3.sample_data.p22009_a10]
)
---------------------------------------------------------------------------
TypecheckFailure                          Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in arg_check(arg, function_name, arg_name, checker)
    583     try:
--> 584         return checker.check(arg, function_name, arg_name)
    585     except TypecheckFailure as e:

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in check(self, x, caller, param)
     84                 pass
---> 85         raise TypecheckFailure()
     86 

TypecheckFailure: 

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-50-8221d851994e> in <module>
     13                 burden_mt3.sample_data.p22009_a8,
     14                 burden_mt3.sample_data.p22009_a9,
---> 15                 burden_mt3.sample_data.p22009_a10]
     16 )

<decorator-gen-1733> in logistic_regression_rows(test, y, x, covariates, pass_through)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    574     @decorator
    575     def wrapper(__original_func, *args, **kwargs):
--> 576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
    577         return __original_func(*args_, **kwargs_)
    578 

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in check_all(f, args, kwargs, checks, is_method)
    541                 raise TypeError(
    542                     f'Expected {n_pos_args} positional arguments, found {len(args)}')
--> 543             args_.append(arg_check(args[i], name, arg_name, checker))
    544         elif param.kind in (param.KEYWORD_ONLY, param.POSITIONAL_OR_KEYWORD):
    545             arg = kwargs.pop(arg_name, param.default)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in arg_check(arg, function_name, arg_name, checker)
    590                             expected=checker.expects(),
    591                             found=checker.format(arg)
--> 592                         )) from e
    593 
    594 

TypeError: logistic_regression_rows: parameter 'y': expected (expression of type float64 or Sequence[expression of type float64]), found list: [[<Float64Expression of type float64>], [<Float64Expression of type float64>], [<Float64Expression of type float64>]]

Similarly, entering phenotypes as a list of lists (of length > 1) (see code/output below), also seems to also just give me a syntax error. I wanted to try this in case each sub-list had to be length > 1.

regression_results1 = hl.logistic_regression_rows(
    test = 'wald',
    y=[[burden_mt3.phecode_10_0.has_phenotype, burden_mt3.phecode_110_1.has_phenotype], [burden_mt3.phecode_110_11.has_phenotype, burden_mt3.phecode_275_5.has_phenotype]],
    x=burden_mt3.n_variants,
    covariates=[1.0,
                burden_mt3.sample_data.p22009_a1,
                burden_mt3.sample_data.p22009_a2,
                burden_mt3.sample_data.p22009_a3,
                burden_mt3.sample_data.p22009_a4,
                burden_mt3.sample_data.p22009_a5,
                burden_mt3.sample_data.p22009_a6,
                burden_mt3.sample_data.p22009_a7,
                burden_mt3.sample_data.p22009_a8,
                burden_mt3.sample_data.p22009_a9,
                burden_mt3.sample_data.p22009_a10]
)
---------------------------------------------------------------------------
TypecheckFailure                          Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in arg_check(arg, function_name, arg_name, checker)
    583     try:
--> 584         return checker.check(arg, function_name, arg_name)
    585     except TypecheckFailure as e:

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in check(self, x, caller, param)
     84                 pass
---> 85         raise TypecheckFailure()
     86 

TypecheckFailure: 

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<timed exec> in <module>

<decorator-gen-1733> in logistic_regression_rows(test, y, x, covariates, pass_through)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    574     @decorator
    575     def wrapper(__original_func, *args, **kwargs):
--> 576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
    577         return __original_func(*args_, **kwargs_)
    578 

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in check_all(f, args, kwargs, checks, is_method)
    541                 raise TypeError(
    542                     f'Expected {n_pos_args} positional arguments, found {len(args)}')
--> 543             args_.append(arg_check(args[i], name, arg_name, checker))
    544         elif param.kind in (param.KEYWORD_ONLY, param.POSITIONAL_OR_KEYWORD):
    545             arg = kwargs.pop(arg_name, param.default)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in arg_check(arg, function_name, arg_name, checker)
    590                             expected=checker.expects(),
    591                             found=checker.format(arg)
--> 592                         )) from e
    593 
    594 

TypeError: logistic_regression_rows: parameter 'y': expected (expression of type float64 or Sequence[expression of type float64]), found list: [[<Float64Expression of type float64>, <Float64Expression of type float64>], [<Float64Expression of type float64>, <Float64Expression of type float64>]]

Execution appears to take a long time

SEE BELOW FOR UPDATE

I was able to run a logistic regression on a single phenotype (code below for reference), but it took a long time—7 minutes. I’m wondering what is making it take so long, and whether there’s a way to parallelize this that I’m not taking advantage of. My guess is that Hail already does this under the hood (such that when we figure out how to run regression on multiple phenotypes, it will go faster than just running them serially), but I’m not sure.

One guess I have for why this is taking so long is that the burden_mt3 I’m using is just a few commands away from the mt_snps described in this post. Since mt_snps seems like it might be so complicated an expression that I can’t even write it to disk, I wonder whether a lot of processing to materialize burden_mt3 is occurring after I run the regression.

So, one thing I’m going to try next is to write intermediate matrixtables to disk and then (re-)read them. My hope is that this way, Hail won’t have to keep track of and run all the statements each time I do something to materialize the matrixtable.

%%time 

regression_results3 = hl.logistic_regression_rows(
    test = 'wald',
    y=burden_mt3.phecode_275_5.has_phenotype,
    x=burden_mt3.n_variants,
    covariates=[1.0,
                burden_mt3.sample_data.p22009_a1,
                burden_mt3.sample_data.p22009_a2,
                burden_mt3.sample_data.p22009_a3,
                burden_mt3.sample_data.p22009_a4,
                burden_mt3.sample_data.p22009_a5,
                burden_mt3.sample_data.p22009_a6,
                burden_mt3.sample_data.p22009_a7,
                burden_mt3.sample_data.p22009_a8,
                burden_mt3.sample_data.p22009_a9,
                burden_mt3.sample_data.p22009_a10]
)
CPU times: user 538 ms, sys: 17 ms, total: 555 ms
Wall time: 7min 2s

UPDATE: I started reading/writing intermediate matrixtables, and got a 70x speed-up:

CPU times: user 54 ms, sys: 7.24 ms, total: 61.2 ms
Wall time: 5.67 s

If this isn’t emphasized in the Hail documentation (as I don’t believe it is), I think it could really help! In retrospect, I was probably spending a long time just asking the cluster to run the same code over and over again to materialize basically the same matrixtable with slight modification.

Best,
Jeremy

So, my earlier comment about groups only applies to linear_regression_rows (nb: linear). For logistic regression rows, the array syntax should treat each phenotype as an independent regression.

I believe the error message is telling you that

burden_mt3.phecode_110_1.has_phenotype

Is all True or all False. You can look at that phenotype directly:

burden_mt3.phecode_110_1.has_phenotype.collect()

Is that all True or all False? I would expect you’ll also get an error if you run this:

regression_results1 = hl.logistic_regression_rows(
    test = 'wald',
    y=[burden_mt3.phecode_110_1.has_phenotype],
    x=burden_mt3.n_variants,
)
regression_results1.collect()

Running the regressions repeatedly will be slightly slower because you have to load the data from disk into RAM for each regression separately. Hail can amortize the cost of loading across many regressions if you use an array of phenotypes.


It does seem that Hail can’t write your full mt_snps because its too large. I’d try doing half of the phenotypes at a time.


Can you share the full script with the checkpoints? This suggests to me that either there’s a lot of “shuffling” (data reordering) or converting from other formats (e.g. from VCF files).

Thank you @danking ! Based on the output I’m seeing, I wonder whether logistic_regression_rows() in the Hail version I’m using (0.2.78) doesn’t have the functionality of treating each phenotype as an independent regression. The regression fails on the array of phenotypes with the y is constant error, but succeeds on each individual phenotype, except for one where it fails because of a different reason.

Any thoughts you have on a workaround would be appreciated! Worst case scenario is perhaps that I just run the regressions in series individually. I’m also open to suggestions on how to fix the HailException: Failed to fit logistic regression null model error I get for the single regression.


Here are my findings…

The has_phenotype column field doesn’t appear to be all the same across the columns, given the input/output below:
Input:

print(f'phecode_10_0'+' :'+str(mt_temp.aggregate_cols(hl.agg.counter(mt_temp.phecode_10_0.has_phenotype))))
print(f'phecode_110_1'+' :'+str(mt_temp.aggregate_cols(hl.agg.counter(mt_temp.phecode_110_1.has_phenotype))))
print(f'phecode_110_11'+' :'+str(mt_temp.aggregate_cols(hl.agg.counter(mt_temp.phecode_110_11.has_phenotype))))
print(f'phecode_132_0'+' :'+str(mt_temp.aggregate_cols(hl.agg.counter(mt_temp.phecode_132_0.has_phenotype))))
print(f'phecode_132_1'+' :'+str(mt_temp.aggregate_cols(hl.agg.counter(mt_temp.phecode_132_1.has_phenotype))))
print(f'phecode_275_5'+' :'+str(mt_temp.aggregate_cols(hl.agg.counter(mt_temp.phecode_275_5.has_phenotype))))

Output:

phecode_10_0 :frozendict({0.0: 372364, 1.0: 403, None: 95645})
phecode_110_1 :frozendict({0.0: 398689, 1.0: 157, None: 69566})
phecode_110_11 :frozendict({0.0: 398689, 1.0: 192, None: 69531})
phecode_132_0 :frozendict({0.0: 404374, 1.0: 71, None: 63967})
phecode_132_1 :frozendict({0.0: 404374, 1.0: 14, None: 64024})
phecode_275_5 :frozendict({0.0: 398121, 1.0: 3883, None: 66408})

I checked one of these numbers against the original CSV file, and they aren’t the same, but I think this is ok because the samples in my matrixtable went through QC.


Regression on an array of phenotypes

When I try to run logistic regression on an array of phenotypes, I get the ‘y must be non-constant’ error. It also tells me that 103361 of 468412 samples have a missing phenotype or covariate, which is more than the missingness of any individual phenotype above.

Input code:

regression_results1 = hl.logistic_regression_rows(
    test = 'wald',
    y=[mt_temp.phecode_10_0.has_phenotype, mt_temp.phecode_110_1.has_phenotype, mt_temp.phecode_110_11.has_phenotype, mt_temp.phecode_132_0.has_phenotype, mt_temp.phecode_132_1.has_phenotype, mt_temp.phecode_275_5.has_phenotype],
    x=mt_temp.n_variants,
    covariates=[1.0,
                mt_temp.age_float, #note that this is mis-labeled, it is actually sex_float
                mt_temp.p21003_i0_squared,
                mt_temp.p21003_i0_float,
                mt_temp.sample_data.p22009_a1,
                mt_temp.sample_data.p22009_a2,
                mt_temp.sample_data.p22009_a3,
                mt_temp.sample_data.p22009_a4,
                mt_temp.sample_data.p22009_a5,
                mt_temp.sample_data.p22009_a6,
                mt_temp.sample_data.p22009_a7,
                mt_temp.sample_data.p22009_a8,
                mt_temp.sample_data.p22009_a9,
                mt_temp.sample_data.p22009_a10]
)

Output error:

2022-12-21 04:45:49 Hail: INFO: Coerced sorted dataset
2022-12-21 04:45:49 Hail: INFO: Coerced dataset with out-of-order partitions.
2022-12-21 04:46:19 Hail: INFO: Coerced sorted dataset
2022-12-21 04:46:19 Hail: INFO: Coerced dataset with out-of-order partitions.
2022-12-21 04:46:22 Hail: WARN: 103361 of 468412 samples have a missing phenotype or covariate.

---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-94-0327d46b4e47> in <module>
     16                 mt_temp.sample_data.p22009_a8,
     17                 mt_temp.sample_data.p22009_a9,
---> 18                 mt_temp.sample_data.p22009_a10]
     19 )

<decorator-gen-1733> in logistic_regression_rows(test, y, x, covariates, pass_through)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/methods/statgen.py in logistic_regression_rows(test, y, x, covariates, pass_through)
    865         result = result.transmute(**result.logistic_regression[0])
    866 
--> 867     return result.persist()
    868 
    869 

<decorator-gen-1143> in persist(self, storage_level)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/table.py in persist(self, storage_level)
   1868             Persisted table.
   1869         """
-> 1870         return Env.backend().persist_table(self, storage_level)
   1871 
   1872     def unpersist(self) -> 'Table':

/opt/conda/lib/python3.6/site-packages/hail/backend/spark_backend.py in persist_table(self, t, storage_level)
    288 
    289     def persist_table(self, t, storage_level):
--> 290         return Table._from_java(self._jbackend.pyPersistTable(storage_level, self._to_java_table_ir(t._tir)))
    291 
    292     def unpersist_table(self, t):

/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     29                 raise FatalError('%s\n\nJava stack trace:\n%s\n'
     30                                  'Hail version: %s\n'
---> 31                                  'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     32         except pyspark.sql.utils.CapturedException as e:
     33             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: HailException: For logistic regression, y at index 1 must be non-constant

Java stack trace:
is.hail.utils.HailException: For logistic regression, y at index 1 must be non-constant
	at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:11)
	at is.hail.utils.package$.fatal(package.scala:78)
	at is.hail.methods.LogisticRegression$$anonfun$execute$1.apply$mcVI$sp(LogisticRegression.scala:45)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
	at is.hail.methods.LogisticRegression.execute(LogisticRegression.scala:40)
	at is.hail.expr.ir.functions.WrappedMatrixToTableFunction.execute(RelationalFunctions.scala:51)
	at is.hail.expr.ir.TableToTableApply.execute(TableIR.scala:2934)
	at is.hail.expr.ir.TableIR.analyzeAndExecute(TableIR.scala:57)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:27)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1$$anonfun$apply$9.apply(SparkBackend.scala:502)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1$$anonfun$apply$9.apply(SparkBackend.scala:501)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:47)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:46)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:46)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:501)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:493)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
	at is.hail.backend.spark.SparkBackend.pyPersistTable(SparkBackend.scala:493)
	at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)



Hail version: 0.2.78-b17627756568
Error summary: HailException: For logistic regression, y at index 1 must be non-constant

Regression on each phenotype individually

Input code looks like below, where the column fields vary according to this list:
['phecode_10_0', 'phecode_110_1', 'phecode_110_11', 'phecode_132_0', 'phecode_132_1', 'phecode_275_5']

regression_results1 = hl.logistic_regression_rows(
    test = 'wald',
    y=[mt_temp.phecode_10_0.has_phenotype],
    x=mt_temp.n_variants,
    covariates=[1.0,
                mt_temp.age_float, #note that this is mis-labeled, it is actually sex_float
                mt_temp.p21003_i0_squared,
                mt_temp.p21003_i0_float,
                mt_temp.sample_data.p22009_a1,
                mt_temp.sample_data.p22009_a2,
                mt_temp.sample_data.p22009_a3,
                mt_temp.sample_data.p22009_a4,
                mt_temp.sample_data.p22009_a5,
                mt_temp.sample_data.p22009_a6,
                mt_temp.sample_data.p22009_a7,
                mt_temp.sample_data.p22009_a8,
                mt_temp.sample_data.p22009_a9,
                mt_temp.sample_data.p22009_a10]
)

The only one with an error was phecode_10_0; the rest ran fine. The error was HailException: Failed to fit logistic regression null model; I’m not sure how to fix this now but haven’t dug into it yet either.

2022-12-21 04:31:03 Hail: INFO: Coerced sorted dataset
2022-12-21 04:31:03 Hail: INFO: Coerced dataset with out-of-order partitions.
2022-12-21 04:31:25 Hail: INFO: Coerced sorted dataset
2022-12-21 04:31:25 Hail: INFO: Coerced dataset with out-of-order partitions.
2022-12-21 04:31:28 Hail: WARN: 95645 of 468412 samples have a missing phenotype or covariate.
2022-12-21 04:31:28 Hail: INFO: logistic_regression_rows: running wald on 372767 samples for response variable y,
    with input variable x, and 14 additional covariates...

---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-84-cfcdd48d8841> in <module>
     16                 mt_temp.sample_data.p22009_a8,
     17                 mt_temp.sample_data.p22009_a9,
---> 18                 mt_temp.sample_data.p22009_a10]
     19 )

<decorator-gen-1733> in logistic_regression_rows(test, y, x, covariates, pass_through)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/methods/statgen.py in logistic_regression_rows(test, y, x, covariates, pass_through)
    865         result = result.transmute(**result.logistic_regression[0])
    866 
--> 867     return result.persist()
    868 
    869 

<decorator-gen-1143> in persist(self, storage_level)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/table.py in persist(self, storage_level)
   1868             Persisted table.
   1869         """
-> 1870         return Env.backend().persist_table(self, storage_level)
   1871 
   1872     def unpersist(self) -> 'Table':

/opt/conda/lib/python3.6/site-packages/hail/backend/spark_backend.py in persist_table(self, t, storage_level)
    288 
    289     def persist_table(self, t, storage_level):
--> 290         return Table._from_java(self._jbackend.pyPersistTable(storage_level, self._to_java_table_ir(t._tir)))
    291 
    292     def unpersist_table(self, t):

/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     29                 raise FatalError('%s\n\nJava stack trace:\n%s\n'
     30                                  'Hail version: %s\n'
---> 31                                  'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     32         except pyspark.sql.utils.CapturedException as e:
     33             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: HailException: Failed to fit logistic regression null model (standard MLE with covariates only): exploded at Newton iteration 8

Java stack trace:
is.hail.utils.HailException: Failed to fit logistic regression null model (standard MLE with covariates only): exploded at Newton iteration 8
	at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:11)
	at is.hail.utils.package$.fatal(package.scala:78)
	at is.hail.methods.LogisticRegression$$anonfun$2.apply(LogisticRegression.scala:67)
	at is.hail.methods.LogisticRegression$$anonfun$2.apply(LogisticRegression.scala:58)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.Range.foreach(Range.scala:160)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at is.hail.methods.LogisticRegression.execute(LogisticRegression.scala:58)
	at is.hail.expr.ir.functions.WrappedMatrixToTableFunction.execute(RelationalFunctions.scala:51)
	at is.hail.expr.ir.TableToTableApply.execute(TableIR.scala:2934)
	at is.hail.expr.ir.TableIR.analyzeAndExecute(TableIR.scala:57)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:27)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1$$anonfun$apply$9.apply(SparkBackend.scala:502)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1$$anonfun$apply$9.apply(SparkBackend.scala:501)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:47)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:46)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:46)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:501)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:493)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
	at is.hail.backend.spark.SparkBackend.pyPersistTable(SparkBackend.scala:493)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)



Hail version: 0.2.78-b17627756568
Error summary: HailException: Failed to fit logistic regression null model (standard MLE with covariates only): exploded at Newton iteration 8

Best,
Jeremy

Ah, I’m sorry. You’re correct, logistic regression rows appears to treat the entire y-vector as one group. That’s kind of annoying. I’ll look into fixing that properly. Apologies for my error before. I was misunderstanding how logistic_regression_rows worked.

With respect to the null model issue, we use an iterative fit of the model where beta_x=0. It seems that iterative fit “exploded”. I think this can happen when the phenotype is perfectly separable / predictable from the covariates alone.

Thank you @danking ! No worries. Would you have any suggestions on how to parallelize logistic_regression_rows() based on its current functionality? With 1500 phenotypes, ~10 seconds/phenotype, we’re looking at 250 minutes, which is not awful.

I’m also not confident I know how to do the error handling properly (so that one bad regression doesn’t halt the entire operation)… I’ll start looking into this too.

Best,
Jeremy

Hmm. Yeah that really is not ideal. The way we have logistic regression implemented is antagonistic to running many log regs in parallel with different phenotype missingness patterns.

In the meantime, my best recommendations are:

  1. For phenotypes that do have equivalent missingness patterns, use the array form of logreg.
  2. Perhaps use linear regression rows instead (which supports multiple phenotypes with distinct missingness patterns at once)? If a linear model shows a strong relationship, then you can follow up with the much more resource intensive logistic regression to get the odds ratio.

In general, linear regression will be way faster, so it seems like a valuable first pass filtering regardless. AFAIK, you’ll get a strong linear relationship iff you have a strong logistic relationship.