OK—coming back with two one questions! One on grouping the phenotypes properly, and another on slow execution speed (UPDATE: I figured out why execution was so slow… I wasn’t writing intermediate matrixtables with checkpoint()
).
Missingness pattern issue
I think I am having trouble formatting the phenotypes properly so that I get around this missingness behavior you referred to in the second point of your post above. When I try to run the phenotypes either as a list or as a list of lists, I get errors:
Entering phenotypes as a single list (see code/output below), I think Hail is treating this as one group and is filtering out all samples that are missing for any of the phenotypes, as you said. The error that I get (y must be non constant
) is probably happening because all of the 1
entries are getting filtered out. The the phenotypes (phecode_xx_x.has_phenotype
) is a column field that is a 1
for samples that have the phecode, 0
if they don’t, and missing (I think) if they weren’t annotated. When we constructed the cases/controls, we excluded some samples from the control population if they had a confounding phenotype, so the missing patterns will be different for each of the phecodes.
regression_results1 = hl.logistic_regression_rows(
test = 'wald',
y=[burden_mt3.phecode_10_0.has_phenotype, burden_mt3.phecode_110_1.has_phenotype, burden_mt3.phecode_110_11.has_phenotype],
x=burden_mt3.n_variants,
covariates=[1.0,
burden_mt3.sample_data.p22009_a1,
burden_mt3.sample_data.p22009_a2,
burden_mt3.sample_data.p22009_a3,
burden_mt3.sample_data.p22009_a4,
burden_mt3.sample_data.p22009_a5,
burden_mt3.sample_data.p22009_a6,
burden_mt3.sample_data.p22009_a7,
burden_mt3.sample_data.p22009_a8,
burden_mt3.sample_data.p22009_a9,
burden_mt3.sample_data.p22009_a10]
)
2022-12-19 21:23:16 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:23:45 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:24:13 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:24:40 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:25:49 Hail: INFO: Coerced sorted dataset
2022-12-19 21:25:49 Hail: INFO: Coerced dataset with out-of-order partitions.
2022-12-19 21:27:05 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:27:32 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:28:00 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:28:27 Hail: INFO: Ordering unsorted dataset with network shuffle
2022-12-19 21:29:33 Hail: INFO: Coerced sorted dataset
2022-12-19 21:29:33 Hail: INFO: Coerced dataset with out-of-order partitions.
2022-12-19 21:29:44 Hail: WARN: 99919 of 468412 samples have a missing phenotype or covariate.
---------------------------------------------------------------------------
FatalError Traceback (most recent call last)
<ipython-input-49-dc544c769d5c> in <module>
13 burden_mt3.sample_data.p22009_a8,
14 burden_mt3.sample_data.p22009_a9,
---> 15 burden_mt3.sample_data.p22009_a10]
16 )
<decorator-gen-1733> in logistic_regression_rows(test, y, x, covariates, pass_through)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
575 def wrapper(__original_func, *args, **kwargs):
576 args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577 return __original_func(*args_, **kwargs_)
578
579 return wrapper
/opt/conda/lib/python3.6/site-packages/hail/methods/statgen.py in logistic_regression_rows(test, y, x, covariates, pass_through)
865 result = result.transmute(**result.logistic_regression[0])
866
--> 867 return result.persist()
868
869
<decorator-gen-1143> in persist(self, storage_level)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
575 def wrapper(__original_func, *args, **kwargs):
576 args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577 return __original_func(*args_, **kwargs_)
578
579 return wrapper
/opt/conda/lib/python3.6/site-packages/hail/table.py in persist(self, storage_level)
1868 Persisted table.
1869 """
-> 1870 return Env.backend().persist_table(self, storage_level)
1871
1872 def unpersist(self) -> 'Table':
/opt/conda/lib/python3.6/site-packages/hail/backend/spark_backend.py in persist_table(self, t, storage_level)
288
289 def persist_table(self, t, storage_level):
--> 290 return Table._from_java(self._jbackend.pyPersistTable(storage_level, self._to_java_table_ir(t._tir)))
291
292 def unpersist_table(self, t):
/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
29 raise FatalError('%s\n\nJava stack trace:\n%s\n'
30 'Hail version: %s\n'
---> 31 'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
32 except pyspark.sql.utils.CapturedException as e:
33 raise FatalError('%s\n\nJava stack trace:\n%s\n'
FatalError: HailException: For logistic regression, y at index 1 must be non-constant
Java stack trace:
is.hail.utils.HailException: For logistic regression, y at index 1 must be non-constant
at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:11)
at is.hail.utils.package$.fatal(package.scala:78)
at is.hail.methods.LogisticRegression$$anonfun$execute$1.apply$mcVI$sp(LogisticRegression.scala:45)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at is.hail.methods.LogisticRegression.execute(LogisticRegression.scala:40)
at is.hail.expr.ir.functions.WrappedMatrixToTableFunction.execute(RelationalFunctions.scala:51)
at is.hail.expr.ir.TableToTableApply.execute(TableIR.scala:2934)
at is.hail.expr.ir.TableIR.analyzeAndExecute(TableIR.scala:57)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:27)
at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1$$anonfun$apply$9.apply(SparkBackend.scala:502)
at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1$$anonfun$apply$9.apply(SparkBackend.scala:501)
at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
at is.hail.backend.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:47)
at is.hail.backend.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:46)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:46)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:501)
at is.hail.backend.spark.SparkBackend$$anonfun$pyPersistTable$1.apply(SparkBackend.scala:493)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.pyPersistTable(SparkBackend.scala:493)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Hail version: 0.2.78-b17627756568
Error summary: HailException: For logistic regression, y at index 1 must be non-constant
Entering phenotypes as a list of lists (see code/output below), as you suggested in your post above, seems to just give me a syntax error.
regression_results1 = hl.logistic_regression_rows(
test = 'wald',
y=[[burden_mt3.phecode_10_0.has_phenotype], [burden_mt3.phecode_110_1.has_phenotype], [burden_mt3.phecode_110_11.has_phenotype]],
x=burden_mt3.n_variants,
covariates=[1.0,
burden_mt3.sample_data.p22009_a1,
burden_mt3.sample_data.p22009_a2,
burden_mt3.sample_data.p22009_a3,
burden_mt3.sample_data.p22009_a4,
burden_mt3.sample_data.p22009_a5,
burden_mt3.sample_data.p22009_a6,
burden_mt3.sample_data.p22009_a7,
burden_mt3.sample_data.p22009_a8,
burden_mt3.sample_data.p22009_a9,
burden_mt3.sample_data.p22009_a10]
)
---------------------------------------------------------------------------
TypecheckFailure Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in arg_check(arg, function_name, arg_name, checker)
583 try:
--> 584 return checker.check(arg, function_name, arg_name)
585 except TypecheckFailure as e:
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in check(self, x, caller, param)
84 pass
---> 85 raise TypecheckFailure()
86
TypecheckFailure:
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<ipython-input-50-8221d851994e> in <module>
13 burden_mt3.sample_data.p22009_a8,
14 burden_mt3.sample_data.p22009_a9,
---> 15 burden_mt3.sample_data.p22009_a10]
16 )
<decorator-gen-1733> in logistic_regression_rows(test, y, x, covariates, pass_through)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
574 @decorator
575 def wrapper(__original_func, *args, **kwargs):
--> 576 args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
577 return __original_func(*args_, **kwargs_)
578
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in check_all(f, args, kwargs, checks, is_method)
541 raise TypeError(
542 f'Expected {n_pos_args} positional arguments, found {len(args)}')
--> 543 args_.append(arg_check(args[i], name, arg_name, checker))
544 elif param.kind in (param.KEYWORD_ONLY, param.POSITIONAL_OR_KEYWORD):
545 arg = kwargs.pop(arg_name, param.default)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in arg_check(arg, function_name, arg_name, checker)
590 expected=checker.expects(),
591 found=checker.format(arg)
--> 592 )) from e
593
594
TypeError: logistic_regression_rows: parameter 'y': expected (expression of type float64 or Sequence[expression of type float64]), found list: [[<Float64Expression of type float64>], [<Float64Expression of type float64>], [<Float64Expression of type float64>]]
Similarly, entering phenotypes as a list of lists (of length > 1) (see code/output below), also seems to also just give me a syntax error. I wanted to try this in case each sub-list had to be length > 1.
regression_results1 = hl.logistic_regression_rows(
test = 'wald',
y=[[burden_mt3.phecode_10_0.has_phenotype, burden_mt3.phecode_110_1.has_phenotype], [burden_mt3.phecode_110_11.has_phenotype, burden_mt3.phecode_275_5.has_phenotype]],
x=burden_mt3.n_variants,
covariates=[1.0,
burden_mt3.sample_data.p22009_a1,
burden_mt3.sample_data.p22009_a2,
burden_mt3.sample_data.p22009_a3,
burden_mt3.sample_data.p22009_a4,
burden_mt3.sample_data.p22009_a5,
burden_mt3.sample_data.p22009_a6,
burden_mt3.sample_data.p22009_a7,
burden_mt3.sample_data.p22009_a8,
burden_mt3.sample_data.p22009_a9,
burden_mt3.sample_data.p22009_a10]
)
---------------------------------------------------------------------------
TypecheckFailure Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in arg_check(arg, function_name, arg_name, checker)
583 try:
--> 584 return checker.check(arg, function_name, arg_name)
585 except TypecheckFailure as e:
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in check(self, x, caller, param)
84 pass
---> 85 raise TypecheckFailure()
86
TypecheckFailure:
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<timed exec> in <module>
<decorator-gen-1733> in logistic_regression_rows(test, y, x, covariates, pass_through)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
574 @decorator
575 def wrapper(__original_func, *args, **kwargs):
--> 576 args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
577 return __original_func(*args_, **kwargs_)
578
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in check_all(f, args, kwargs, checks, is_method)
541 raise TypeError(
542 f'Expected {n_pos_args} positional arguments, found {len(args)}')
--> 543 args_.append(arg_check(args[i], name, arg_name, checker))
544 elif param.kind in (param.KEYWORD_ONLY, param.POSITIONAL_OR_KEYWORD):
545 arg = kwargs.pop(arg_name, param.default)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in arg_check(arg, function_name, arg_name, checker)
590 expected=checker.expects(),
591 found=checker.format(arg)
--> 592 )) from e
593
594
TypeError: logistic_regression_rows: parameter 'y': expected (expression of type float64 or Sequence[expression of type float64]), found list: [[<Float64Expression of type float64>, <Float64Expression of type float64>], [<Float64Expression of type float64>, <Float64Expression of type float64>]]
Execution appears to take a long time
SEE BELOW FOR UPDATE
I was able to run a logistic regression on a single phenotype (code below for reference), but it took a long time—7 minutes. I’m wondering what is making it take so long, and whether there’s a way to parallelize this that I’m not taking advantage of. My guess is that Hail already does this under the hood (such that when we figure out how to run regression on multiple phenotypes, it will go faster than just running them serially), but I’m not sure.
One guess I have for why this is taking so long is that the burden_mt3
I’m using is just a few commands away from the mt_snps
described in this post. Since mt_snps
seems like it might be so complicated an expression that I can’t even write it to disk, I wonder whether a lot of processing to materialize burden_mt3
is occurring after I run the regression.
So, one thing I’m going to try next is to write intermediate matrixtables to disk and then (re-)read them. My hope is that this way, Hail won’t have to keep track of and run all the statements each time I do something to materialize the matrixtable.
%%time
regression_results3 = hl.logistic_regression_rows(
test = 'wald',
y=burden_mt3.phecode_275_5.has_phenotype,
x=burden_mt3.n_variants,
covariates=[1.0,
burden_mt3.sample_data.p22009_a1,
burden_mt3.sample_data.p22009_a2,
burden_mt3.sample_data.p22009_a3,
burden_mt3.sample_data.p22009_a4,
burden_mt3.sample_data.p22009_a5,
burden_mt3.sample_data.p22009_a6,
burden_mt3.sample_data.p22009_a7,
burden_mt3.sample_data.p22009_a8,
burden_mt3.sample_data.p22009_a9,
burden_mt3.sample_data.p22009_a10]
)
CPU times: user 538 ms, sys: 17 ms, total: 555 ms
Wall time: 7min 2s
UPDATE: I started reading/writing intermediate matrixtables, and got a 70x speed-up:
CPU times: user 54 ms, sys: 7.24 ms, total: 61.2 ms
Wall time: 5.67 s
If this isn’t emphasized in the Hail documentation (as I don’t believe it is), I think it could really help! In retrospect, I was probably spending a long time just asking the cluster to run the same code over and over again to materialize basically the same matrixtable with slight modification.
Best,
Jeremy