I have driver node, and few separate worker nodes. At first glimpse looks like its working fine. I can see the nodes in web ui, I can load and collect csv from s3. I also tried to run spark.parallelize and it works well across the nodes.

I am able to load MatrixTable from S3:

mt = hl.read_matrix_table(database_location)
[Call(alleles=[0, 1], phased=True),
 Call(alleles=[1, 0], phased=True),
 Call(alleles=[0, 0], phased=True),
 Call(alleles=[0, 0], phased=True),
 Call(alleles=[0, 0], phased=True)]

But when I run for example mt.summarize() I get error:

Java stack trace:
Hail version: 0.2.104-1940d9e8eaab
Error summary: FileNotFoundException: File /tmp/aggregate_intermediates/-pqQ4bnM3U52mspCinC9ilT4c68fc27-1bde-4fad-926b-31c2fd467d50 does not exist

This file exists in worker /tmp directory, but not on the driver machine.

Is there a setting I am missing?


/tmp is probably not network-visible – you could try setting a temp dir in hl.init that points to a s3 bucket (use a retention policy so that you’re not paying for temp data indefinitely)

Thanks @tpoterba for fast reply!

I created /tmp folder in the hdfs storage and pointed tmp_dir parameter to it and and now it works.
hl.init(sc=spark, tmp_dir='hdfs://')