Fixing logreg, lmmreg error when using many sample covariates on Dataproc

We were surprised to find that using more than 9 sample covariates in logistic or linear mixed regression on Google Dataproc would throw an error. We’ve engaged Google support on a superior fix, but in the meantime they’ve suggested and we’ve verified the work around of including the properties spark.driver.extraJavaOptions=-Xss4M and spark.executor.extraJavaOptions=-Xss4M in the cluster creation command to increase the Java stack size, e.g.:

--properties="spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M"

Alternatively these can be included in the Spark submit command. See this post for more information on using Hail on the Google cloud.

For those interested, the underlying issue relates to the linear solve routine in LAPACK called by Breeze natives, as mentioned in this StackOverflow post.

Hi, Thanks for setting guide.

I am still use hail0.1 version -0d9d9fa
I use this setting succeed in handling 1T dataset. (import vcf to vds)

Now I need to handle 3T dataset, I am wondering do I need to increase this recommend -Xss4M ?

We do not support Hail 0.1 anymore. You’ll have to try the current way and increase it if you run into problems. We strongly recommend updating to Hail 0.2, which will have better performance.