Fixing logreg, lmmreg error when using many sample covariates on Dataproc

jbloom · March 10, 2017, 2:16am

We were surprised to find that using more than 9 sample covariates in logistic or linear mixed regression on Google Dataproc would throw an error. We’ve engaged Google support on a superior fix, but in the meantime they’ve suggested and we’ve verified the work around of including the properties spark.driver.extraJavaOptions=-Xss4M and spark.executor.extraJavaOptions=-Xss4M in the cluster creation command to increase the Java stack size, e.g.:

--properties="spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M"

Alternatively these can be included in the Spark submit command. See this post for more information on using Hail on the Google cloud.

For those interested, the underlying issue relates to the linear solve routine in LAPACK called by Breeze natives, as mentioned in this StackOverflow post.

shuang · July 31, 2020, 6:27am

Hi, Thanks for setting guide.

I am still use hail0.1 version -0d9d9fa
I use this setting succeed in handling 1T dataset. (import vcf to vds)

Now I need to handle 3T dataset, I am wondering do I need to increase this recommend -Xss4M ?

johnc1231 · August 5, 2020, 1:55pm

We do not support Hail 0.1 anymore. You’ll have to try the current way and increase it if you run into problems. We strongly recommend updating to Hail 0.2, which will have better performance.

Topic		Replies	Views
Running Hail on Databricks Help [0.1]	5	1383	March 29, 2017
Linear_regression with multiple phenotypes in Hail 0.2 Hail Query & hailctl	8	764	August 24, 2018
Heap out of memory Hail Query & hailctl	14	1803	July 21, 2020
"Insufficient Memory" + performance issues on Google Cloud Help [0.1]	4	2789	July 26, 2017
Py4JJava Error Generating Dataset Hail Query & hailctl	5	681	December 4, 2020

Fixing logreg, lmmreg error when using many sample covariates on Dataproc

Related topics