Hail Exception crash during export step - how to diagnose

We are trialing a new on-prem hail cluster. We have an old on-prem cluster which doesn’t display the problem, and we thought we had replicated the provisioning, but we clearly missed something:

We are finding that writing an mt works fine, but exporting is failing, and we can’t work out what to look / tweak to begin to diagnose the problem - any advice appreciated.

Note: This problem happens regardless of whether we’re trying to export tables or matrixtables.

Details below.

import

mt = hl.import_vcf(“s3a://ddd-elgh/chr21.test.vcf.gz”, force_bgz=True)
2019-05-24 15:15:10 Hail: WARN: expected input file `s3a://ddd-elgh/chr21.test.vcf.gz’ to end in .vcf[.bgz, .gz]

write mt works fine

mt.write(“s3a://ddd-elgh/test.mt”, overwrite=True)
2019-05-24 15:15:14 Hail: INFO: Coerced sorted dataset
2019-05-24 15:15:25 Hail: INFO: wrote matrix table with 547 rows and 12644 columns in 2 partitions to s3a://ddd-elgh/test.mt

#export to vcf blows up
hl.export_vcf(mt, “s3a://ddd-elgh/test.vcf”)


FatalError Traceback (most recent call last)
in
----> 1 hl.export_vcf(mt, “s3a://ddd-elgh/test.vcf”)

FatalError: HailException: Expected 2 part files but found 0

Java stack trace:
is.hail.utils.HailException: Expected 2 part files but found 0
at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:9)
at is.hail.utils.package$.fatal(package.scala:28)
at is.hail.utils.richUtils.RichHadoopConfiguration$.copyMerge$extension(RichHadoopConfiguration.scala:178)
at is.hail.utils.richUtils.RichRDD$.writeTable$extension(RichRDD.scala:84)
at is.hail.io.vcf.ExportVCF$.apply(ExportVCF.scala:466)
at is.hail.expr.ir.MatrixVCFWriter.apply(MatrixWriter.scala:37)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:751)
at is.hail.expr.ir.Interpret$.apply(Interpret.scala:87)
at is.hail.expr.ir.CompileAndEvaluate$.apply(CompileAndEvaluate.scala:31)
at is.hail.backend.spark.SparkBackend$.execute(SparkBackend.scala:49)
at is.hail.backend.spark.SparkBackend$.executeJSON(SparkBackend.scala:16)
at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

This likely has to do with the way the cluster is configured. Are you running on EMR? Does it have HDFS running?

The difference between write and export_vcf (with parallel=None) is that write directly writes files to the destination directory, but export_vcf writes parallel shards to the temporary directory tmp_dir specified in hl.init. This directory must be a network-visible file system – files written on worker machines must be visible to the driver machine. This defaults to /tmp, which without a full URI uses the default file scheme, which is usually HDFS (a network-visible file system) if it’s running.

If the tmp_dir is not network-visible (say file:///tmp), this error will appear when the driver tries to concatenate the shards into one file, and can’t find pieces it expects to be there.

Is HDFS better/mandatory, or is any network filesystem ok? i.e. does the choice of a specific network filesystem make any difference?

Different network file systems will have different properties, but any network file system should be fine (e.g. lustre works fine). I think object stores like gs or s3 sometimes don’t work in this case, though, since they’re not really network file systems, just pretending.

Thank you. We had set up a shared tmp_dir, but on careful inspection it was not correctly visible to the slaves: once we fixed that, the exports ran fine.