Exporting a 20M variant x 400K sample MatrixTable to (ideally) BGEN format

ERu · November 20, 2019, 3:44am

We’ve been loving Hail for wrangling the UK Biobank data, but are now at a stage of needing to export from Hail to BGEN format so we can use some other association analysis tools.

With various filters, we have a single MatrixTable of 20M variants and 400K individuals (18K shards), and we’ve been attempting to use export_bgen on AWS. Once we embraced the fact that a single BGEN was way beyond the memory capabilities, we moved on to trying a repartition with 500 partitions (optimized for downstream analysis) followed by export_bgen with the parallel flag, specifically:

            mt = mt.repartition(
                n_partitions=500,
                shuffle=True
            )
            hl.export_bgen(
                mt,
                'output_path',
                parallel='header_per_shard'
            )

Here’s the flags we used to run this on an AWS EMR cluster with 100 core and 400 task nodes, all r4.xlarge instance types:

readonly DRIVER_MEM=$(cat /proc/meminfo | \
  awk '$1 == "MemTotal:" {print int($2 / 2**20 * 0.8)}')

spark-submit \
  --jars ${HAIL_HOME}/hail-all-spark.jar \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \
  --conf spark.driver.memory=${DRIVER_MEM}g \
  --conf spark.driver.maxResultSize=0 \
  --conf spark.task.maxFailures=5 \
  --conf spark.kryoserializer.buffer.max=1g \
  --conf spark.driver.extraJavaOptions=-Xss4M \
  --conf spark.executor.extraJavaOptions=-Xss4M \
  --conf spark.network.timeout=10000000s \
  --conf spark.executor.heartbeatInterval=1000000s \
  --conf spark.executor.memory=21g \
  --conf spark.executor.memoryOverhead=1g \
  --conf spark.executor.cores=4 \
  --conf spark.driver.cores=4 \
  --master yarn \

This failed after the repartitioning step with 500 of the 18K to go; I’d guess the export step. The full log is 17Mb, so can send if needed, but I think this is probably a relevant error - java.io.IOException: No space left on device.

I can think of three options from here:

Tweaking configuration or cluster composition to work with existing commands
Running the parallel export on all 18K shards and using something like cat-bgen from bgenix to get to the 500 partition size of interest
Using export_gen and converting with bgenix

Thanks for your help!

danking · November 20, 2019, 6:53pm

Hi @ERu,

This error means the hard drives on your worker nodes are full. You’ll need free space on the order of the size of your input data. Possibly more because the data isn’t stored and compressed together and is therefore larger.

Definitely don’t use export_gen, we’ve carefully optimized the BGEN import/export code. We have not done so to the gen code.

Moreover, if you’re running on flaky nodes (Google’s preemptibles or Amazon spot instances), you might benefit from writing a temporary file and reading the file before the repartition. For the repartition step, you should use reliable nodes (e.g. non-preemptibles).

tpoterba · November 20, 2019, 10:55pm

I would try repartition(shuffle=False) rather than True, if you’re reducing partitioning.

Topic		Replies	Views
Write Bgen Fatal Error Hail Query & hailctl	3	434	July 17, 2023
Export bgen from VDS for 14 million variants and 414k samples of AoU Hail Query & hailctl	0	18	May 26, 2025
Importing large BGEN into Hail Matrix Table Hail Query & hailctl	4	484	July 2, 2021
Unable to export genotypes Hail Query & hailctl	12	691	October 12, 2022
Export_bgen (to header per shard) empty bgen files Hail Query & hailctl	2	250	July 26, 2023

Exporting a 20M variant x 400K sample MatrixTable to (ideally) BGEN format

Related topics