Time taken to read a large VCF file

Aks · September 10, 2020, 7:41am

Hello Everyone,

I was trying to read a large VCF file. The size of which is found using ls -lh command in unix and it produces the below output

-rw-rw-r-- 1 Test Test 42G Sep 10 13:31 merged_file.vcf.gz

May I check with you on how long does hail take to read/import the above file?

hl.import_vcf('merged_file.vcf.gz',force_bgz=True).write('/home/test/merged.mt', overwrite=True)

If I don’t provide force_bgz argument, it throws an error. So, I have provided it.

It’s been running for more than an hour.

Could there be any other reason to encounter such a delay or it is expected for my file size?
can hail read multiple files one by one and append to a matrixtable?

Please do let me know if you expect me to run any other unix command to know the hardware details of my server

My system details by lscpu command gave me the below output

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             1
NUMA node(s):          1
CPU family:            6
Model:                 79
Stepping:              1
CPU MHz:               2499.921
CPU max MHz:           2900.0000
CPU min MHz:           1200.0000
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-23

lsblk command gave the below output

NAME        MAJ:MIN RM  SIZE RO TYPE      MOUNTPOINT
Testa           8:0    0  3.7T  0 disk
└─Testa1        8:1    0  3.7T  0 part    /data/T1
Testb           8:16   0  3.7T  0 disk
└─Testb1        8:17   0  3.7T  0 part    /data/T2
Testc           8:32   0  3.7T  0 disk
└─Testc1        8:33   0  3.7T  0 part    /data/T3
nvme0n1     259:0    0  477G  0 disk
├─nvme0n1p1 259:1    0  476G  0 part /
├─nvme0n1p2 259:2    0    1K  0 part
└─nvme0n1p5 259:3    0  975M  0 part

tpoterba · September 10, 2020, 11:43am

Execution time will depend on a bunch of factors, and even the ones you’ve listed aren’t enough to give a good estimate of runtime. One of the things you can look at to get a sense for progress is the Spark progress bar – are you seeing this? It looks like the below:

[Stage 0:======================================================> (97 + 3) / 100]

The way to read these numbers at the end are as (C + R) / T, where C = tasks completed, R is tasks currently running, and T is total tasks.

I think you should expect T to be somewhere around 500-2000, R to be 48 (if it’s not, Hail is not using all the resources on your machine), and C to be a notion of progress.

Aks · September 14, 2020, 10:24am

Hi @tpoterba,

Thanks for the response. useful to know this

q1) So the numbers that you gave 500-2000 for T and 48 for R is based on the task that I have mentioned above or was it inferred from my hardware configs pasted above?

q2) I would like to learn from you on how did you get those numbers and which info from the screenshot can be used to find out whether hail using all my system resources or not.

q3) - Should I explicitly assign chunk of my system storage to run hail? All I want is to make sure that hail is using all my system resources so that it is fast. can you guide me/direct me on how can I update the update memory settings?.

As I am new to all this infrastructure thing, it would be helpful. Can help me with the above questions please?

tpoterba · September 14, 2020, 4:09pm

q1) the 500-2000 number is a rough number of partitions that should perform well for a 42G file. There’s probably a larger range than that which will work fine, but this is a good place to start. Hail should be automatically choosing a number of partitions based on the file size, using either 32M or 128M splits, which would give you something in this neighborhood.

q2) You should see a progress bar when you run the import_vcf/write pipeline above. If not, how are you running Python?

q3) Hail should interact with your storage system like any other tool, no input needed there.

Aks · September 15, 2020, 7:39am

Hi @tpoterba,

Aappreciate your help and value your time. quick follow up questions

q2) Yes, I usually see a progress bar in the jupyter notebook (black screen). You know when we launch jupyter notebook using command jupyter notebook, there is a black screen which contains the logs/whatever and a browser which has the jupyter notebook. Black screen is where I see a progress bar.

q3) So, I don’t have to manually assign any specific or temp directory storage like how hail installation doc had earlier (I guess that doc is now deprecated). I remember seeing something like below in old doc

conf = pyspark.SparkConf().setAll([
    ('spark.jars', str(hail_jars)),
    ('spark.driver.extraClassPath', str(hail_jars)),
    ('spark.executor.extraClassPath', './hail-all-spark.jar'),
    ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
    ('spark.kryo.registrator', 'is.hail.kryo.HailKryoRegistrator'),
    ('spark.driver.memory', '180g'),
    ('spark.executor.memory', '180g'),
    ('spark.local.dir', '/t1,/data/abcd/spark')
])

But now I don’t see these instructions in the hail doc for installation. So, does that mean hail can make use of all the available memory and perform optimal allocation?

tpoterba · September 15, 2020, 10:59am

Hail sets the necessary configuration through spark.kryo.registrator above automatically. If you’re running on a single server, the best way to configure memory is with the PYSPARK_SUBMIT_ARGS environment variable (just search that on the forum here).

What does the progress bar look like for you when you’re running the import/write?

Topic		Replies	Views
Working with large VCFs (e.g. from UK Biobank) is slow Hail Query & hailctl	12	1913	August 23, 2024
Fail write it in Hail format after loading a ~1Tb bgzipped VCF Hail Query & hailctl	6	783	February 14, 2019
Large scale ingest Hail Query & hailctl	7	694	April 8, 2019
Memory and disk space requirements Hail Query & hailctl	8	688	October 10, 2022
When is it necessary to modify block_size when using import_vcf? Hail Query & hailctl	5	529	March 6, 2023

Time taken to read a large VCF file

Related topics