Running Hail on Databricks

Hello,

I’m hoping I can get some help with running Hail on Databricks. I was able to run through most of the 1000 Genomes HailTutorial notebook but came across some problems. Particularly when running linear and logistic regression.

I get the following error message when running the “Linear regression with covariates” step.


vds_gwas = (vds_QCed
    .filter_variants_expr('va.qc.AF > 0.05 && va.qc.AF < 0.95')
    .annotate_samples_vds(vds_pca, code='sa.pca = vds.pca')
    .linreg('sa.pheno.CaffeineConsumption',
        covariates=['sa.pca.PC1', 'sa.pca.PC2', 'sa.pca.PC3', 'sa.pheno.isFemale']))

FatalError: NoSuchMethodError: breeze.linalg.DenseVector$.canDotD()Lbreeze/generic/UFunc$UImpl2;
---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-50-7bdfe8ed2fc8> in <module>()
      3     .annotate_samples_vds(vds_pca, code='sa.pca = vds.pca')
      4     .linreg('sa.pheno.CaffeineConsumption',
----> 5             covariates=['sa.pca.PC1', 'sa.pca.PC2', 'sa.pca.PC3', 'sa.pheno.isFemale']))

/local_disk0/spark-b6f170e4-17f2-4f52-8dd5-257009d97113/userFiles-a79d2bc9-9c87-4039-b6e0-7f7b2432d5af/addedFile568486992112226566dbfs__FileStore_jars_91d39314_abe4_4e1b_bf17_5efe19d52bb6_hail_devel_py2_7_databricks_20ed0-e92c7.egg/hail/dataset.py in linreg(self, y, covariates, root, minac, minaf)
   2111                     self._jvds, y, jarray(env.jvm.java.lang.String, covariates), root, minac, minaf))
   2112         except Py4JJavaError as e:
-> 2113             raise_py4j_exception(e)
   2114 
   2115     def lmmreg(self, kinship_vds, y, covariates=[], global_root="global.lmmreg", va_root="va.lmmreg",

/local_disk0/spark-b6f170e4-17f2-4f52-8dd5-257009d97113/userFiles-a79d2bc9-9c87-4039-b6e0-7f7b2432d5af/addedFile568486992112226566dbfs__FileStore_jars_91d39314_abe4_4e1b_bf17_5efe19d52bb6_hail_devel_py2_7_databricks_20ed0-e92c7.egg/hail/java.py in raise_py4j_exception(e)
     85 def raise_py4j_exception(e):
     86     msg = env.jutils.getMinimalMessage(e.java_exception)
---> 87     raise FatalError(msg, e.java_exception)

FatalError: NoSuchMethodError: breeze.linalg.DenseVector$.canDotD()Lbreeze/generic/UFunc$UImpl2;

Is this because the version of Hail provided by the tutorial is older? I also noticed that some of the commands in the tutorial notebook are different from the commands on the main Hail website. Is there a way to build and upload a new version of Hail .jar and .egg files?

Thank you,
-Jon

You’re right about the version in the tutorial being older – it’s from early Feb., and I think we’ll wait a few weeks until we make a versioned Hail release before updating that.

This error seems to come from a version mismatch between the Spark/Breeze we compiled for and the version running on the cluster.

Thank you for the reply. Is there a way that I can build my own egg and jar files from source? Or do you recommend waiting until a more sable version?

I was able to compile a new hail-all-spark.jar using the recommended procedure, but I don’t see any instructions for creating the .egg file. Is it OK to just run python2.7 bdist_egg setup.py?

-Jon

Yeah, that’s exactly right. I have a gradle task for this in a branch, but haven’t merged it.

Hi Tim,

Sorry to keep bothering you about this.

I was able to create a hail.all-spark.jar using ./gradlew shadowJar command and was able to run through the entire tutorial on my local machine. However, when I tried building the egg file using python2.7 setup.py bdist_egg, the resulting egg file didn’t include any of the top level scripts. So I had to move the setup.py file up one level to include all of the .py files.

I imported the new .jar and .egg files to Databricks and started running through the tutorial again. I am getting stuck at the first genotype filtering step when creating the vds_gAB dataset where I get a FataError: ClassNotFoundException: is.hail.variant.Genotype. (Full error message below).

Have you encountered this error before? Is there something that I am missing when building the package for Databricks?

Thank you,
-Jon

P.S. Regarding my previous problem with Breeze, I realized I was running on a Spark 2.1 cluster. I was able to successfully run the tutorial when I changed to a Spark 2.0 (Auto-updating, Scala 2.11) cluster as listed in the tutorial itself.


Running on a Spark 2.0 (Auto-updating, Scala 2.11) Community Optimized cluster.

# Works fine on original vds
vds.count(genotypes = True)

Out[13]: 
{u'callRate': 98.81098285727502,
 u'nCalled': 7689777L,
 u'nGenotypes': 7782310L,
 u'nSamples': 710,
 u'nVariants': 10961L}

# Doesn't work on filtered vds
filter_condition_ab = '''let ab = g.ad[1] / g.ad.sum() in
   ((g.isHomRef() && ab <= 0.1) ||
   (g.isHet() && ab >= 0.25 && ab <= 0.75) ||
   (g.isHomVar() && ab >= 0.9))'''

vds_gAB = vds.filter_genotypes(filter_condition_ab)

print(vds_gAB.count(genotype=True)

FatalError: ClassNotFoundException: is.hail.variant.Genotype
---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-11-a930bb2f54ec> in <module>()
----> 1 vds_gDP.count(genotypes = True)

<decorator-gen-187> in count(self, genotypes)

/local_disk0/spark-be704e0a-0e5d-4f00-9705-b125434c6654/userFiles-7a36ef92-9010-4261-94c1-9c2d9b7c6124/addedFile253781640921013150dbfs__FileStore_jars_55d94108_72dd_4c60_978e_442aecaf16df_hail_devel_py2_7_2fd2a-7a3c0.egg/hail/java.py in handle_py4j(func, *args, **kwargs)
    108     except Py4JJavaError as e:
    109         msg = Env.jutils().getMinimalMessage(e.java_exception)
--> 110         raise FatalError(msg)
    111     except Py4JError as e:
    112         Env.jutils().log().error('hail: caught python exception: ' + str(e))

FatalError: ClassNotFoundException: is.hail.variant.Genotype

Interesting, this is a new one! I can’t think of a reason off the top of my head why the class loader isn’t finding Genotype. The only time we’ve seen ClassNotFoundExceptions is when the jar isn’t visible on the worker nodes. I’ll think on this a bit.