Hello,
I’m hoping I can get some help with running Hail on Databricks. I was able to run through most of the 1000 Genomes HailTutorial notebook but came across some problems. Particularly when running linear and logistic regression.
I get the following error message when running the “Linear regression with covariates” step.
vds_gwas = (vds_QCed
.filter_variants_expr('va.qc.AF > 0.05 && va.qc.AF < 0.95')
.annotate_samples_vds(vds_pca, code='sa.pca = vds.pca')
.linreg('sa.pheno.CaffeineConsumption',
covariates=['sa.pca.PC1', 'sa.pca.PC2', 'sa.pca.PC3', 'sa.pheno.isFemale']))
FatalError: NoSuchMethodError: breeze.linalg.DenseVector$.canDotD()Lbreeze/generic/UFunc$UImpl2;
---------------------------------------------------------------------------
FatalError Traceback (most recent call last)
<ipython-input-50-7bdfe8ed2fc8> in <module>()
3 .annotate_samples_vds(vds_pca, code='sa.pca = vds.pca')
4 .linreg('sa.pheno.CaffeineConsumption',
----> 5 covariates=['sa.pca.PC1', 'sa.pca.PC2', 'sa.pca.PC3', 'sa.pheno.isFemale']))
/local_disk0/spark-b6f170e4-17f2-4f52-8dd5-257009d97113/userFiles-a79d2bc9-9c87-4039-b6e0-7f7b2432d5af/addedFile568486992112226566dbfs__FileStore_jars_91d39314_abe4_4e1b_bf17_5efe19d52bb6_hail_devel_py2_7_databricks_20ed0-e92c7.egg/hail/dataset.py in linreg(self, y, covariates, root, minac, minaf)
2111 self._jvds, y, jarray(env.jvm.java.lang.String, covariates), root, minac, minaf))
2112 except Py4JJavaError as e:
-> 2113 raise_py4j_exception(e)
2114
2115 def lmmreg(self, kinship_vds, y, covariates=[], global_root="global.lmmreg", va_root="va.lmmreg",
/local_disk0/spark-b6f170e4-17f2-4f52-8dd5-257009d97113/userFiles-a79d2bc9-9c87-4039-b6e0-7f7b2432d5af/addedFile568486992112226566dbfs__FileStore_jars_91d39314_abe4_4e1b_bf17_5efe19d52bb6_hail_devel_py2_7_databricks_20ed0-e92c7.egg/hail/java.py in raise_py4j_exception(e)
85 def raise_py4j_exception(e):
86 msg = env.jutils.getMinimalMessage(e.java_exception)
---> 87 raise FatalError(msg, e.java_exception)
FatalError: NoSuchMethodError: breeze.linalg.DenseVector$.canDotD()Lbreeze/generic/UFunc$UImpl2;
Is this because the version of Hail provided by the tutorial is older? I also noticed that some of the commands in the tutorial notebook are different from the commands on the main Hail website. Is there a way to build and upload a new version of Hail .jar and .egg files?
Thank you,
-Jon
You’re right about the version in the tutorial being older – it’s from early Feb., and I think we’ll wait a few weeks until we make a versioned Hail release before updating that.
This error seems to come from a version mismatch between the Spark/Breeze we compiled for and the version running on the cluster.
Thank you for the reply. Is there a way that I can build my own egg
and jar
files from source? Or do you recommend waiting until a more sable version?
I was able to compile a new hail-all-spark.jar
using the recommended procedure, but I don’t see any instructions for creating the .egg
file. Is it OK to just run python2.7 bdist_egg setup.py
?
-Jon
Yeah, that’s exactly right. I have a gradle task for this in a branch, but haven’t merged it.
Hi Tim,
Sorry to keep bothering you about this.
I was able to create a hail.all-spark.jar
using ./gradlew shadowJar
command and was able to run through the entire tutorial on my local machine. However, when I tried building the egg file using python2.7 setup.py bdist_egg
, the resulting egg file didn’t include any of the top level scripts. So I had to move the setup.py
file up one level to include all of the .py
files.
I imported the new .jar
and .egg
files to Databricks and started running through the tutorial again. I am getting stuck at the first genotype filtering step when creating the vds_gAB
dataset where I get a FataError: ClassNotFoundException: is.hail.variant.Genotype
. (Full error message below).
Have you encountered this error before? Is there something that I am missing when building the package for Databricks?
Thank you,
-Jon
P.S. Regarding my previous problem with Breeze, I realized I was running on a Spark 2.1 cluster. I was able to successfully run the tutorial when I changed to a Spark 2.0 (Auto-updating, Scala 2.11) cluster as listed in the tutorial itself.
Running on a Spark 2.0 (Auto-updating, Scala 2.11) Community Optimized cluster.
# Works fine on original vds
vds.count(genotypes = True)
Out[13]:
{u'callRate': 98.81098285727502,
u'nCalled': 7689777L,
u'nGenotypes': 7782310L,
u'nSamples': 710,
u'nVariants': 10961L}
# Doesn't work on filtered vds
filter_condition_ab = '''let ab = g.ad[1] / g.ad.sum() in
((g.isHomRef() && ab <= 0.1) ||
(g.isHet() && ab >= 0.25 && ab <= 0.75) ||
(g.isHomVar() && ab >= 0.9))'''
vds_gAB = vds.filter_genotypes(filter_condition_ab)
print(vds_gAB.count(genotype=True)
FatalError: ClassNotFoundException: is.hail.variant.Genotype
---------------------------------------------------------------------------
FatalError Traceback (most recent call last)
<ipython-input-11-a930bb2f54ec> in <module>()
----> 1 vds_gDP.count(genotypes = True)
<decorator-gen-187> in count(self, genotypes)
/local_disk0/spark-be704e0a-0e5d-4f00-9705-b125434c6654/userFiles-7a36ef92-9010-4261-94c1-9c2d9b7c6124/addedFile253781640921013150dbfs__FileStore_jars_55d94108_72dd_4c60_978e_442aecaf16df_hail_devel_py2_7_2fd2a-7a3c0.egg/hail/java.py in handle_py4j(func, *args, **kwargs)
108 except Py4JJavaError as e:
109 msg = Env.jutils().getMinimalMessage(e.java_exception)
--> 110 raise FatalError(msg)
111 except Py4JError as e:
112 Env.jutils().log().error('hail: caught python exception: ' + str(e))
FatalError: ClassNotFoundException: is.hail.variant.Genotype
Interesting, this is a new one! I can’t think of a reason off the top of my head why the class loader isn’t finding Genotype
. The only time we’ve seen ClassNotFoundException
s is when the jar isn’t visible on the worker nodes. I’ll think on this a bit.