PheWAS on DNAnexus UKB RAP

OK—coming back with two one questions! One on grouping the phenotypes properly, and another on slow execution speed (UPDATE: I figured out why execution was so slow… I wasn’t writing intermediate matrixtables with checkpoint()).

Missingness pattern issue

I think I am having trouble formatting the phenotypes properly so that I get around this missingness behavior you referred to in the second point of your post above. When I try to run the phenotypes either as a list or as a list of lists, I get errors:

Entering phenotypes as a single list (see code/output below), I think Hail is treating this as one group and is filtering out all samples that are missing for any of the phenotypes, as you said. The error that I get (y must be non constant) is probably happening because all of the 1 entries are getting filtered out. The the phenotypes (phecode_xx_x.has_phenotype) is a column field that is a 1 for samples that have the phecode, 0 if they don’t, and missing (I think) if they weren’t annotated. When we constructed the cases/controls, we excluded some samples from the control population if they had a confounding phenotype, so the missing patterns will be different for each of the phecodes.

regression_results1 = hl.logistic_regression_rows(
    test = 'wald',
    y=[burden_mt3.phecode_10_0.has_phenotype, burden_mt3.phecode_110_1.has_phenotype, burden_mt3.phecode_110_11.has_phenotype],
    x=burden_mt3.n_variants,
    covariates=[1.0,
                burden_mt3.sample_data.p22009_a1,
                burden_mt3.sample_data.p22009_a2,
                burden_mt3.sample_data.p22009_a3,
                burden_mt3.sample_data.p22009_a4,
                burden_mt3.sample_data.p22009_a5,
                burden_mt3.sample_data.p22009_a6,
                burden_mt3.sample_data.p22009_a7,
                burden_mt3.sample_data.p22009_a8,
                burden_mt3.sample_data.p22009_a9,
                burden_mt3.sample_data.p22009_a10]
)
2022-12-19 21:23:16 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:23:45 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:24:13 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:24:40 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:25:49 Hail: INFO: Coerced sorted dataset
2022-12-19 21:25:49 Hail: INFO: Coerced dataset with out-of-order partitions.
2022-12-19 21:27:05 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:27:32 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:28:00 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:28:27 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:29:33 Hail: INFO: Coerced sorted dataset
2022-12-19 21:29:33 Hail: INFO: Coerced dataset with out-of-order partitions.
2022-12-19 21:29:44 Hail: WARN: 99919 of 468412 samples have a missing phenotype or covariate.

---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-49-dc544c769d5c> in <module>
     13                 burden_mt3.sample_data.p22009_a8,
     14                 burden_mt3.sample_data.p22009_a9,
---> 15                 burden_mt3.sample_data.p22009_a10]
     16 )

<decorator-gen-1733> in logistic_regression_rows(test, y, x, covariates, pass_through)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/methods/statgen.py in logistic_regression_rows(test, y, x, covariates, pass_through)
    865         result = result.transmute(**result.logistic_regression[0])
    866 
--> 867     return result.persist()
    868 
    869 

<decorator-gen-1143> in persist(self, storage_level)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/table.py in persist(self, storage_level)
   1868             Persisted table.
   1869         """
-> 1870         return Env.backend().persist_table(self, storage_level)
   1871 
   1872     def unpersist(self) -> 'Table':

/opt/conda/lib/python3.6/site-packages/hail/backend/spark_backend.py in persist_table(self, t, storage_level)
    288 
    289     def persist_table(self, t, storage_level):
--> 290         return Table._from_java(self._jbackend.pyPersistTable(storage_level, self._to_java_table_ir(t._tir)))
    291 
    292     def unpersist_table(self, t):

/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     29                 raise FatalError('%s\n\nJava stack trace:\n%s\n'
     30                                  'Hail version: %s\n'
---> 31                                  'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     32         except pyspark.sql.utils.CapturedException as e:
     33             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: HailException: For logistic regression, y at index 1 must be non-constant

Java stack trace:
is.hail.utils.HailException: For logistic regression, y at index 1 must be non-constant
	at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:11)
	at is.hail.utils.package$.fatal(package.scala:78)
	at is.hail.methods.LogisticRegression$$anonfun$execute$1.apply$mcVI$sp(LogisticRegression.scala:45)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
	at is.hail.methods.LogisticRegression.execute(LogisticRegression.scala:40)
	at is.hail.expr.ir.functions.WrappedMatrixToTableFunction.execute(RelationalFunctions.scala:51)
	at is.hail.expr.ir.TableToTableApply.execute(TableIR.scala:2934)
	at is.hail.expr.ir.TableIR.analyzeAndExecute(TableIR.scala:57)
	at is.hail.expr.ir.Interpret$.apply(Interpret.scala:27)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1$$anonfun$apply$9.apply(SparkBackend.scala:502)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1$$anonfun$apply$9.apply(SparkBackend.scala:501)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:47)
	at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:46)
	at is.hail.utils.package$.using(package.scala:638)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:46)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:501)
	at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:493)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
	at is.hail.backend.spark.SparkBackend.pyPersistTable(SparkBackend.scala:493)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)



Hail version: 0.2.78-b17627756568
Error summary: HailException: For logistic regression, y at index 1 must be non-constant

Entering phenotypes as a list of lists (see code/output below), as you suggested in your post above, seems to just give me a syntax error.

regression_results1 = hl.logistic_regression_rows(
    test = 'wald',
    y=[[burden_mt3.phecode_10_0.has_phenotype], [burden_mt3.phecode_110_1.has_phenotype], [burden_mt3.phecode_110_11.has_phenotype]],
    x=burden_mt3.n_variants,
    covariates=[1.0,
                burden_mt3.sample_data.p22009_a1,
                burden_mt3.sample_data.p22009_a2,
                burden_mt3.sample_data.p22009_a3,
                burden_mt3.sample_data.p22009_a4,
                burden_mt3.sample_data.p22009_a5,
                burden_mt3.sample_data.p22009_a6,
                burden_mt3.sample_data.p22009_a7,
                burden_mt3.sample_data.p22009_a8,
                burden_mt3.sample_data.p22009_a9,
                burden_mt3.sample_data.p22009_a10]
)
---------------------------------------------------------------------------
TypecheckFailure                          Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in arg_check(arg, function_name, arg_name, checker)
    583     try:
--> 584         return checker.check(arg, function_name, arg_name)
    585     except TypecheckFailure as e:

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in check(self, x, caller, param)
     84                 pass
---> 85         raise TypecheckFailure()
     86 

TypecheckFailure: 

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-50-8221d851994e> in <module>
     13                 burden_mt3.sample_data.p22009_a8,
     14                 burden_mt3.sample_data.p22009_a9,
---> 15                 burden_mt3.sample_data.p22009_a10]
     16 )

<decorator-gen-1733> in logistic_regression_rows(test, y, x, covariates, pass_through)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    574     @decorator
    575     def wrapper(__original_func, *args, **kwargs):
--> 576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
    577         return __original_func(*args_, **kwargs_)
    578 

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in check_all(f, args, kwargs, checks, is_method)
    541                 raise TypeError(
    542                     f'Expected {n_pos_args} positional arguments, found {len(args)}')
--> 543             args_.append(arg_check(args[i], name, arg_name, checker))
    544         elif param.kind in (param.KEYWORD_ONLY, param.POSITIONAL_OR_KEYWORD):
    545             arg = kwargs.pop(arg_name, param.default)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in arg_check(arg, function_name, arg_name, checker)
    590                             expected=checker.expects(),
    591                             found=checker.format(arg)
--> 592                         )) from e
    593 
    594 

TypeError: logistic_regression_rows: parameter 'y': expected (expression of type float64 or Sequence[expression of type float64]), found list: [[<Float64Expression of type float64>], [<Float64Expression of type float64>], [<Float64Expression of type float64>]]

Similarly, entering phenotypes as a list of lists (of length > 1) (see code/output below), also seems to also just give me a syntax error. I wanted to try this in case each sub-list had to be length > 1.

regression_results1 = hl.logistic_regression_rows(
    test = 'wald',
    y=[[burden_mt3.phecode_10_0.has_phenotype, burden_mt3.phecode_110_1.has_phenotype], [burden_mt3.phecode_110_11.has_phenotype, burden_mt3.phecode_275_5.has_phenotype]],
    x=burden_mt3.n_variants,
    covariates=[1.0,
                burden_mt3.sample_data.p22009_a1,
                burden_mt3.sample_data.p22009_a2,
                burden_mt3.sample_data.p22009_a3,
                burden_mt3.sample_data.p22009_a4,
                burden_mt3.sample_data.p22009_a5,
                burden_mt3.sample_data.p22009_a6,
                burden_mt3.sample_data.p22009_a7,
                burden_mt3.sample_data.p22009_a8,
                burden_mt3.sample_data.p22009_a9,
                burden_mt3.sample_data.p22009_a10]
)
---------------------------------------------------------------------------
TypecheckFailure                          Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in arg_check(arg, function_name, arg_name, checker)
    583     try:
--> 584         return checker.check(arg, function_name, arg_name)
    585     except TypecheckFailure as e:

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in check(self, x, caller, param)
     84                 pass
---> 85         raise TypecheckFailure()
     86 

TypecheckFailure: 

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<timed exec> in <module>

<decorator-gen-1733> in logistic_regression_rows(test, y, x, covariates, pass_through)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    574     @decorator
    575     def wrapper(__original_func, *args, **kwargs):
--> 576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
    577         return __original_func(*args_, **kwargs_)
    578 

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in check_all(f, args, kwargs, checks, is_method)
    541                 raise TypeError(
    542                     f'Expected {n_pos_args} positional arguments, found {len(args)}')
--> 543             args_.append(arg_check(args[i], name, arg_name, checker))
    544         elif param.kind in (param.KEYWORD_ONLY, param.POSITIONAL_OR_KEYWORD):
    545             arg = kwargs.pop(arg_name, param.default)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in arg_check(arg, function_name, arg_name, checker)
    590                             expected=checker.expects(),
    591                             found=checker.format(arg)
--> 592                         )) from e
    593 
    594 

TypeError: logistic_regression_rows: parameter 'y': expected (expression of type float64 or Sequence[expression of type float64]), found list: [[<Float64Expression of type float64>, <Float64Expression of type float64>], [<Float64Expression of type float64>, <Float64Expression of type float64>]]

Execution appears to take a long time

SEE BELOW FOR UPDATE

I was able to run a logistic regression on a single phenotype (code below for reference), but it took a long time—7 minutes. I’m wondering what is making it take so long, and whether there’s a way to parallelize this that I’m not taking advantage of. My guess is that Hail already does this under the hood (such that when we figure out how to run regression on multiple phenotypes, it will go faster than just running them serially), but I’m not sure.

One guess I have for why this is taking so long is that the burden_mt3 I’m using is just a few commands away from the mt_snps described in this post. Since mt_snps seems like it might be so complicated an expression that I can’t even write it to disk, I wonder whether a lot of processing to materialize burden_mt3 is occurring after I run the regression.

So, one thing I’m going to try next is to write intermediate matrixtables to disk and then (re-)read them. My hope is that this way, Hail won’t have to keep track of and run all the statements each time I do something to materialize the matrixtable.

%%time 

regression_results3 = hl.logistic_regression_rows(
    test = 'wald',
    y=burden_mt3.phecode_275_5.has_phenotype,
    x=burden_mt3.n_variants,
    covariates=[1.0,
                burden_mt3.sample_data.p22009_a1,
                burden_mt3.sample_data.p22009_a2,
                burden_mt3.sample_data.p22009_a3,
                burden_mt3.sample_data.p22009_a4,
                burden_mt3.sample_data.p22009_a5,
                burden_mt3.sample_data.p22009_a6,
                burden_mt3.sample_data.p22009_a7,
                burden_mt3.sample_data.p22009_a8,
                burden_mt3.sample_data.p22009_a9,
                burden_mt3.sample_data.p22009_a10]
)
CPU times: user 538 ms, sys: 17 ms, total: 555 ms
Wall time: 7min 2s

UPDATE: I started reading/writing intermediate matrixtables, and got a 70x speed-up:

CPU times: user 54 ms, sys: 7.24 ms, total: 61.2 ms
Wall time: 5.67 s

If this isn’t emphasized in the Hail documentation (as I don’t believe it is), I think it could really help! In retrospect, I was probably spending a long time just asking the cluster to run the same code over and over again to materialize basically the same matrixtable with slight modification.

Best,
Jeremy