Time taken to read a large VCF file

Hello Everyone,

I was trying to read a large VCF file. The size of which is found using ls -lh command in unix and it produces the below output

-rw-rw-r-- 1 Test Test 42G Sep 10 13:31 merged_file.vcf.gz

May I check with you on how long does hail take to read/import the above file?

hl.import_vcf('merged_file.vcf.gz',force_bgz=True).write('/home/test/merged.mt', overwrite=True)

If I don’t provide force_bgz argument, it throws an error. So, I have provided it.

It’s been running for more than an hour.

Could there be any other reason to encounter such a delay or it is expected for my file size?
can hail read multiple files one by one and append to a matrixtable?

Please do let me know if you expect me to run any other unix command to know the hardware details of my server

My system details by lscpu command gave me the below output

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             1
NUMA node(s):          1
CPU family:            6
Model:                 79
Stepping:              1
CPU MHz:               2499.921
CPU max MHz:           2900.0000
CPU min MHz:           1200.0000
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-23

lsblk command gave the below output

Testa           8:0    0  3.7T  0 disk
└─Testa1        8:1    0  3.7T  0 part    /data/T1
Testb           8:16   0  3.7T  0 disk
└─Testb1        8:17   0  3.7T  0 part    /data/T2
Testc           8:32   0  3.7T  0 disk
└─Testc1        8:33   0  3.7T  0 part    /data/T3
nvme0n1     259:0    0  477G  0 disk
├─nvme0n1p1 259:1    0  476G  0 part /
├─nvme0n1p2 259:2    0    1K  0 part
└─nvme0n1p5 259:3    0  975M  0 part

Execution time will depend on a bunch of factors, and even the ones you’ve listed aren’t enough to give a good estimate of runtime. One of the things you can look at to get a sense for progress is the Spark progress bar – are you seeing this? It looks like the below:

[Stage 0:======================================================> (97 + 3) / 100]

The way to read these numbers at the end are as (C + R) / T, where C = tasks completed, R is tasks currently running, and T is total tasks.

I think you should expect T to be somewhere around 500-2000, R to be 48 (if it’s not, Hail is not using all the resources on your machine), and C to be a notion of progress.

1 Like

Hi @tpoterba,

Thanks for the response. useful to know this

q1) So the numbers that you gave 500-2000 for T and 48 for R is based on the task that I have mentioned above or was it inferred from my hardware configs pasted above?

q2) I would like to learn from you on how did you get those numbers and which info from the screenshot can be used to find out whether hail using all my system resources or not.

q3) - Should I explicitly assign chunk of my system storage to run hail? All I want is to make sure that hail is using all my system resources so that it is fast. can you guide me/direct me on how can I update the update memory settings?.

As I am new to all this infrastructure thing, it would be helpful. Can help me with the above questions please?

q1) the 500-2000 number is a rough number of partitions that should perform well for a 42G file. There’s probably a larger range than that which will work fine, but this is a good place to start. Hail should be automatically choosing a number of partitions based on the file size, using either 32M or 128M splits, which would give you something in this neighborhood.

q2) You should see a progress bar when you run the import_vcf/write pipeline above. If not, how are you running Python?

q3) Hail should interact with your storage system like any other tool, no input needed there.

Hi @tpoterba,

Aappreciate your help and value your time. quick follow up questions

q2) Yes, I usually see a progress bar in the jupyter notebook (black screen). You know when we launch jupyter notebook using command jupyter notebook, there is a black screen which contains the logs/whatever and a browser which has the jupyter notebook. Black screen is where I see a progress bar.

q3) So, I don’t have to manually assign any specific or temp directory storage like how hail installation doc had earlier (I guess that doc is now deprecated). I remember seeing something like below in old doc

conf = pyspark.SparkConf().setAll([
    ('spark.jars', str(hail_jars)),
    ('spark.driver.extraClassPath', str(hail_jars)),
    ('spark.executor.extraClassPath', './hail-all-spark.jar'),
    ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
    ('spark.kryo.registrator', 'is.hail.kryo.HailKryoRegistrator'),
    ('spark.driver.memory', '180g'),
    ('spark.executor.memory', '180g'),
    ('spark.local.dir', '/t1,/data/abcd/spark')

But now I don’t see these instructions in the hail doc for installation. So, does that mean hail can make use of all the available memory and perform optimal allocation?

Hail sets the necessary configuration through spark.kryo.registrator above automatically. If you’re running on a single server, the best way to configure memory is with the PYSPARK_SUBMIT_ARGS environment variable (just search that on the forum here).

What does the progress bar look like for you when you’re running the import/write?