Working with large VCFs (e.g. from UK Biobank) is slow

Hi folks—

I am new to Hail and the UK Biobank, and I am trying to accomplish what I believe to be a pretty simple task: load a pVCF for one of the chromosome blocks and explore the data (and eventually extract variants/genotypes/patient identifiers (EIDs) within a certain region, although I haven’t gotten there yet). However, when I try to explore the data, the commands run slowly (and I bet I am also not doing it in an efficient way).

The pVCF I’m interested in is ~23 gb in size. I’m using the following code:

import hail as hl
hl.init()  
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()
mt = hl.import_vcf('file:////mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c3_b3_v1.vcf.gz', force_bgz=True, reference_genome="GRCh38")

Some commands run quickly, such as

mt.describe()

However, when I try to look at the loci of the first few rows with the following code, it takes a long time (I’d be very much interested in better alternatives):

row_table = mt.rows().select()
row_table.head(4).locus.show()

I timed the second command and apparently it took 2.5 min:

I am already using what I thought would be a pretty powerful Jupyter notebook instance on DNAnexus, mem2_ssd2_x16. I don’t know whether it’s relevant, but here is information from lscpu and lsblk:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:            1
CPU MHz:             2300.176
BogoMIPS:            4600.15
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-15
NAME     MAJ:MIN  RM  SIZE RO TYPE  MOUNTPOINT
loop0      7:0     0   47M  1 loop  
loop1      7:1     0   47M  1 loop  
loop2      7:2     0 25.1M  1 loop  
loop3      7:3     0 55.5M  1 loop  
loop4      7:4     0 55.6M  1 loop  
xvda     202:0     0   13G  0 disk  
├─xvda1  202:1     0 12.9G  0 part  
├─xvda14 202:14    0    4M  0 part  

Any help would be appreciated!

Best,
Jeremy

Hi Jeremy,
Thanks for reaching out about this. I have two answers.

First, Hail is optimized to run against native formats (Hail tables and matrix tables) rather than VCFs. if you import and write the VCF as a Hail matrix table then work against that, everything should be much faster.

Second, you’ve revealed a performance bug in Hail where some work we need to do to scan a VCF for ordering/chromosome boundaries is being executed for every query rather than once per Hail session. Once we fix this, I think you should see that the first query takes 4m and the rest are fast.

1 Like

Ah, Tim is too fast for me! I have just a tiny bit to add.

If you’re just slicing and subsetting a VCF, I think you’ll have a better time with tools like vcftools. If you’ll eventually want to perform expensive numerical analysis like relatedness, regressions, and PCA, I think you’ll find the cost of importing to an analysis-friendly format (e.g. Hail Matrix Table format) well worth the trouble.

If you go that route I recommend using a cluster of cheap machines without SSDs, storing your data in blob/object storage (like S3), and operating on the dataset as one logical dataset rather than manually chunking along the chromosome. You can import many chromosomally-chunked VCFs together by with a glob pattern.

This change will speed up all queries after the first:

Thank you Tim and Dan for the super quick and helpful responses! A few follow-ups:

  1. Tim, I didn’t realize the distinction between reading in a VCF and reading in a Matrix Table! Now, I see that this idea was mentioned in this line in the Hail GWAS tutorial. Also glad that I somehow stumbled upon a bug—hope it helps others.

  2. Both, for some reason now when I’m trying to read the VCF and write the Matrix Table with the command

hl.import_vcf('file:////mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c3_b33_v1.vcf.gz', force_bgz=True, reference_genome="GRCh38").write('ukb23157_c3_b33_v1.mt', overwrite=True),

I get the following error:

2022-08-19 05:37:57 Hail: INFO: Coerced sorted dataset

---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<timed eval> in <module>

<decorator-gen-1275> in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/matrixtable.py in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)
   2542 
   2543         writer = ir.MatrixNativeWriter(output, overwrite, stage_locally, _codec_spec, _partitions, _partitions_type, _checkpoint_file)
-> 2544         Env.backend().execute(ir.MatrixWrite(self._mir, writer))
   2545 
   2546     class _Show:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
    108                 raise HailUserError(message_and_trace) from None
    109 
--> 110             raise e

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     84         # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
     85         try:
---> 86             result_tuple = self._jhc.backend().executeEncode(jir, stream_codec)
     87             (result, timings) = (result_tuple._1(), result_tuple._2())
     88             value = ir.typ._from_encoding(result)

/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     29                 raise FatalError('%s\n\nJava stack trace:\n%s\n'
     30                                  'Hail version: %s\n'
---> 31                                  'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     32         except pyspark.sql.utils.CapturedException as e:
     33             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: VCFParseError: missing value in FORMAT array. Import with argument 'array_elements_required=False'

I’m using the same instance type as before, so I’m not really sure what could be going on. Sometimes when I run the code, I get a different number of Hail: INFO: Coerced sorted dataset messages too. Any ideas what’s going on? I’m about to try with a more powerful instance just to see if there’s some memory/storage issue.

  1. Dan, thank you also for the vcftools suggestion—yes I think I should look into it. I’m new to all this sort of thing and am not sure about what best practices are for what I’d like to accomplish

  2. Dan, to your point that for larger datasets, it would be better to pay the overhead of importing into a Matrix Table—when would you typically say the cost is worth it? I thought that even this UK Biobank pVCF chromosome block (~20 gb) would reach that level, but maybe you don’t think so—perhaps because it is still small enough to work with using vcftools on a single machine?

  3. I am also just realizing now that I may not be actually using Hail on multiple nodes on the cluster via Spark. I just noticed that I initiated Hail using the code in the GWAS tutorial, which is:

import hail as hl
hl.init()  
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()

When in fact it looks like, according to one of Dan’s previous posts, I should be using (at least the first time)

import pyspark
import dxpy

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
spark.sql("CREATE DATABASE my_database LOCATION  'dnax://'")

… is that right?

Best,
Jeremy

Most VCFs don’t have missing elements inside of FORMAT lists (like AD) and Hail can generate faster queries / smaller datasets if we make that assumption (since we don’t need to track or handle missingness there). In this case just use the flag array_elements_required=False as suggested and the VCF should export just fine.

If you’re planning on doing more than a couple queries, it’s probably worth it.

Yep, I think this is right.

Got it—thank you Tim (and Dan)! I’ve been able to get Hail up and running. Looking forward to exploring the data!

1 Like

Thanks for the quick fix, Tim! This will be so nice when it becomes available on the UKB RAP in 10 months!

2 Likes

I’m facing a similar problem. Here’s my code:

ds = hl.import_vcf('file:///mnt/project/ukb23157_PKD1.vcf.gz', reference_genome='GRCh38', force=True)
print(ds)
ds.write('ukb23157_PKD1.mt')

The first line runs immediately, but the last one (ds.write) takes forever and practically never finishes. The VCF file is only ~10GB in size.

Not sure if this will fix your problem, but I’ve had issues with the UKB RAP that are unpredictable but that can be solved either from (1) restarting the instance or (2) choosing a larger instance. I’ve often used a mem3_ssd1_v2_x4 instance with 5–10 nodes.

I don’t recall why, but I also pass an array_elements_required=False argument to the import_vcf() function too.

If none of that works, I’m sure the Hail team can help with a more sophisticated answer.

Best,
Jeremy

With force=True – standard gzip compression – you use no parallelism, and do everything on a single CPU. If this file is actually block gzipped but with a .gz name, use force_bgz=True.

Hi @tpoterba Would you break this down for me? I’m curious how this works. When you run import_vcf and then run some type of operation on it, does that operation happen directly on the VCF file? What’s the difference between running ETL on this versus a pre-written matrix table?

I do see this information is explicitly written in the docs, but I couldn’t find any information that gets more detailed regarding why VCF import is less performant. I actually never noticed this in the docs before, which is probably my fault… But I also noticed many of the examples in the hail documentation don’t actually import_vcf -> write_MT -> read_MT -> join/transform_MT, which is a bit confusing. I also don’t see any warning or logs about this when I use hail. Admittedly, we’re using a much to old version of hail and so that logging may be there in newer versions of the package.