Memory and disk space requirements

Hello,

I apologise in advance, if this has already been answered somewhere.

I have a giant merged and indexed VCF file with ca. 1400 WGS samples with ~66.5 mil variants. The vcf.gz file is 114GB. I am working on limited resources (64GB ram and have 600 GB disk space left).
I would like to convert this vcf into hail matrix table.
Before I start converting it, I would like to know how much diskspace and memory should I expecting this to be consumed? The reason being, I have other processes running and I would not like this to cause any issues by running out of disk space or memory.

Thank you.

I think the matrixtable on disk will occupy roughly the same size as the VCF. 64GB of RAM should be plenty – what is the configuration of the server where you plan to run Hail (how many cores, etc)?

Hi,
Thanks for quick reply. Its an AWS EC2 instance with 8 threads and 64G RAM.

I think you should be fine. Do follow the advice here to configure max memory correctly (you’ll want to use something like 48G instead of 16G to account for your larger machine)

Oh perfect. Thank you.

Hi again,

Unfortunately the conversion step crashed with No space left on device error after 7 hour of run. Here is the code and the error.

#!/usr/bin/env python

import sys
import os.path

import hail as hl
import sklearn
import pickle

VCF="/data4/temp_mergeVCFs_AMT/merged.vcf.gz"
vcfbase = os.path.basename(VCF)

mtout = "../04-results/03-hailTable/" + vcfbase + ".mt"
hl.import_vcf(VCF, reference_genome='GRCh38', force_bgz=True, array_elements_required=False).write(mtout, overwrite=True)
Initializing Hail with default parameters...
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/bioinfoRD/ARCdata/Projects_AMT/conda_envs/hail/lib/python3.10/site-packages/pyspark/jars/spark-unsafe_2.12-3.1.3.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2022-10-06 07:09:57 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.1.3
SparkUI available at http://ip-172-18-1-233.eu-west-1.compute.internal:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.100-2ea2615a797a
LOGGING: writing to /bioinfoRD/ARCdata/Projects_AMT/2022-09-29_arcWGSgnomadPredict/03-runme/hail-20221006-0709-0.2.100-2ea2615a797a.log
2022-10-06 07:10:07 Hail: INFO: scanning VCF for sortedness...
2022-10-06 08:04:21 Hail: INFO: VCF is out of order...=========>(907 + 2) / 909]
  Write the dataset to disk before running multiple queries to avoid multiple costly data shuffles.
2022-10-06 12:25:36 Hail: INFO: Ordering unsorted dataset with network shuffle9]
Traceback (most recent call last):=>                            (441 + 8) / 909]
  File "/bioinfoRD/ARCdata/Projects_AMT/2022-09-29_arcWGSgnomadPredict/03-runme/./01-vcf2hail.py", line 14, in <module>
    hl.import_vcf(VCF, reference_genome='GRCh38', force_bgz=True, array_elements_required=False).write(mtout, overwrite=True)
  File "<decorator-gen-1172>", line 2, in write
  File "/bioinfoRD/ARCdata/Projects_AMT/conda_envs/hail/lib/python3.10/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/bioinfoRD/ARCdata/Projects_AMT/conda_envs/hail/lib/python3.10/site-packages/hail/matrixtable.py", line 2558, in write
    Env.backend().execute(ir.MatrixWrite(self._mir, writer))
  File "/bioinfoRD/ARCdata/Projects_AMT/conda_envs/hail/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 104, in execute
    self._handle_fatal_error_from_backend(e, ir)
  File "/bioinfoRD/ARCdata/Projects_AMT/conda_envs/hail/lib/python3.10/site-packages/hail/backend/backend.py", line 181, in _handle_fatal_error_from_backend
    raise err
  File "/bioinfoRD/ARCdata/Projects_AMT/conda_envs/hail/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
  File "/bioinfoRD/ARCdata/Projects_AMT/conda_envs/hail/lib/python3.10/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/bioinfoRD/ARCdata/Projects_AMT/conda_envs/hail/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 31, in deco
    raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: IOException: No space left on device

I have plenty of space. Does the program uses /tmp/ directory by any chance?

Yes, you can modify this with hl.init(tmp_dir=...).

Hi,

Thanks for the reply. Unfortunately, it did not help. I get the same error.

Can you use df to determine which file system is running out of memory?

It’s possible that you’re somehow picking up HDFS file paths. What’s the full stack trace? You might try prefixing all your paths with file://. In general I recommend using an EMR cluster. It’ll work out of the box and have better scale up and scale down properties.