We’ve been loving Hail for wrangling the UK Biobank data, but are now at a stage of needing to export from Hail to BGEN format so we can use some other association analysis tools.
With various filters, we have a single MatrixTable of 20M variants and 400K individuals (18K shards), and we’ve been attempting to use export_bgen
on AWS. Once we embraced the fact that a single BGEN was way beyond the memory capabilities, we moved on to trying a repartition with 500 partitions (optimized for downstream analysis) followed by export_bgen
with the parallel
flag, specifically:
mt = mt.repartition(
n_partitions=500,
shuffle=True
)
hl.export_bgen(
mt,
'output_path',
parallel='header_per_shard'
)
Here’s the flags we used to run this on an AWS EMR cluster with 100 core and 400 task nodes, all r4.xlarge
instance types:
readonly DRIVER_MEM=$(cat /proc/meminfo | \
awk '$1 == "MemTotal:" {print int($2 / 2**20 * 0.8)}')
spark-submit \
--jars ${HAIL_HOME}/hail-all-spark.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=is.hail.kryo.HailKryoRegistrator \
--conf spark.driver.memory=${DRIVER_MEM}g \
--conf spark.driver.maxResultSize=0 \
--conf spark.task.maxFailures=5 \
--conf spark.kryoserializer.buffer.max=1g \
--conf spark.driver.extraJavaOptions=-Xss4M \
--conf spark.executor.extraJavaOptions=-Xss4M \
--conf spark.network.timeout=10000000s \
--conf spark.executor.heartbeatInterval=1000000s \
--conf spark.executor.memory=21g \
--conf spark.executor.memoryOverhead=1g \
--conf spark.executor.cores=4 \
--conf spark.driver.cores=4 \
--master yarn \
This failed after the repartitioning step with 500 of the 18K to go; I’d guess the export step. The full log is 17Mb, so can send if needed, but I think this is probably a relevant error - java.io.IOException: No space left on device
.
I can think of three options from here:
- Tweaking configuration or cluster composition to work with existing commands
- Running the parallel export on all 18K shards and using something like
cat-bgen
frombgenix
to get to the 500 partition size of interest - Using
export_gen
and converting withbgenix
Thanks for your help!