Exporting a 20M variant x 400K sample MatrixTable to (ideally) BGEN format

We’ve been loving Hail for wrangling the UK Biobank data, but are now at a stage of needing to export from Hail to BGEN format so we can use some other association analysis tools.

With various filters, we have a single MatrixTable of 20M variants and 400K individuals (18K shards), and we’ve been attempting to use export_bgen on AWS. Once we embraced the fact that a single BGEN was way beyond the memory capabilities, we moved on to trying a repartition with 500 partitions (optimized for downstream analysis) followed by export_bgen with the parallel flag, specifically:

            mt = mt.repartition(
                n_partitions=500,
                shuffle=True
            )
            hl.export_bgen(
                mt,
                'output_path',
                parallel='header_per_shard'
            )

Here’s the flags we used to run this on an AWS EMR cluster with 100 core and 400 task nodes, all r4.xlarge instance types:

readonly DRIVER_MEM=$(cat /proc/meminfo | \
  awk '$1 == "MemTotal:" {print int($2 / 2**20 * 0.8)}')

spark-submit \
  --jars ${HAIL_HOME}/hail-all-spark.jar \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \
  --conf spark.driver.memory=${DRIVER_MEM}g \
  --conf spark.driver.maxResultSize=0 \
  --conf spark.task.maxFailures=5 \
  --conf spark.kryoserializer.buffer.max=1g \
  --conf spark.driver.extraJavaOptions=-Xss4M \
  --conf spark.executor.extraJavaOptions=-Xss4M \
  --conf spark.network.timeout=10000000s \
  --conf spark.executor.heartbeatInterval=1000000s \
  --conf spark.executor.memory=21g \
  --conf spark.executor.memoryOverhead=1g \
  --conf spark.executor.cores=4 \
  --conf spark.driver.cores=4 \
  --master yarn \

This failed after the repartitioning step with 500 of the 18K to go; I’d guess the export step. The full log is 17Mb, so can send if needed, but I think this is probably a relevant error - java.io.IOException: No space left on device.

I can think of three options from here:

  1. Tweaking configuration or cluster composition to work with existing commands
  2. Running the parallel export on all 18K shards and using something like cat-bgen from bgenix to get to the 500 partition size of interest
  3. Using export_gen and converting with bgenix

Thanks for your help!

2 Likes

Hi @ERu,

This error means the hard drives on your worker nodes are full. You’ll need free space on the order of the size of your input data. Possibly more because the data isn’t stored and compressed together and is therefore larger.

Definitely don’t use export_gen, we’ve carefully optimized the BGEN import/export code. We have not done so to the gen code.

Moreover, if you’re running on flaky nodes (Google’s preemptibles or Amazon spot instances), you might benefit from writing a temporary file and reading the file before the repartition. For the repartition step, you should use reliable nodes (e.g. non-preemptibles).

I would try repartition(shuffle=False) rather than True, if you’re reducing partitioning.