We were surprised to find that using more than 9 sample covariates in logistic or linear mixed regression on Google Dataproc would throw an error. We’ve engaged Google support on a superior fix, but in the meantime they’ve suggested and we’ve verified the work around of including the properties spark.driver.extraJavaOptions=-Xss4M
and spark.executor.extraJavaOptions=-Xss4M
in the cluster creation command to increase the Java stack size, e.g.:
--properties="spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M"
Alternatively these can be included in the Spark submit command. See this post for more information on using Hail on the Google cloud.
For those interested, the underlying issue relates to the linear solve routine in LAPACK called by Breeze natives, as mentioned in this StackOverflow post.