Write Bgen Fatal Error

This error (lightly reformatted for legibility):

org.apache.spark.SparkException:
 Job aborted due to stage failure:
   Task 300 in stage 1.0 failed 4 times, most recent failure:
     Lost task 300.3 in stage 1.0 (TID 2322) (all-of-us-6636-sw-1zlp.c.terra-vpc-sc-e41b7259.internal executor 370):
       org.apache.hadoop.ipc.RemoteException(java.io.IOException):
         File /tmp/export-bgen-concatenated-LpLH7K29SkI8krTADdGvj7/part-0300-ee8d1e84-ecdb-45c0-a499-abfe13ac0704
         could only be written to 0 of the 1 minReplication nodes. There are 2 
         datanode(s) running and 2 node(s) are excluded in this operation.

Indicates that you are writing to HDFS (HadoopFileSystem). HDFS is a networked filesystem that is only serviced by the non-preemptible / non-spot instances in your cluster. When your cluster primarily consists of preemptible / spot instances (as it should, these are one-fifth the cost and Hail is designed to use them), HDFS becomes effectively unusable. You have two choices:

  1. Use more non-preemptible / non-spot instances (at 5x the cost) to make HDFS more reliable.
  2. Don’t use HDFS.

I strongly recommend the later. S3 / GCS / ABFS are more reliable, scalable, and transcend the lifetime of a cluster. In particular, it looks like you must have your temporary directory set to /tmp. A very confusing aspect of Spark and Hadoop clusters is that /foo refers to /foo in HDFS. If you want to refer to the local filesystem of an instance you have to write file:/foo. The simple fix here is this:

hl.init(tmp_dir='gs://my-7-day-bucket/')

Where my-7-day-bucket is a bucket with an “Object Lifecycle Management policy”. For example, Hail uses this 7 day lifecycle policy:

  "lifecycle": {
    "rule": [
      {
        "action": {
          "type": "Delete"
        },
        "condition": {
          "age": 7
        }
      }
    ]
  }

This empowers you to use this bucket for temporary storage knowing that whatever data is stored there will be deleted after seven days. This helps control cost. Seven days might even be too long! If all of your pipelines complete within a single day, you can use a shorter age. I recommend giving yourself a bit of buffer though!


The parallel option avoids using HDFS entirely because it only writes to the final destination. In non-parallel mode, Hail needs to write to a temporary location before reading from that temporary location to concatenate the files. Concatenation is rather slow anyway (it is a necessarily serial, single-machine operation). If you can get by with shared BGENs, I strongly recommend using that.