Working with large VCFs (e.g. from UK Biobank) is slow

Thank you Tim and Dan for the super quick and helpful responses! A few follow-ups:

  1. Tim, I didn’t realize the distinction between reading in a VCF and reading in a Matrix Table! Now, I see that this idea was mentioned in this line in the Hail GWAS tutorial. Also glad that I somehow stumbled upon a bug—hope it helps others.

  2. Both, for some reason now when I’m trying to read the VCF and write the Matrix Table with the command

hl.import_vcf('file:////mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c3_b33_v1.vcf.gz', force_bgz=True, reference_genome="GRCh38").write('ukb23157_c3_b33_v1.mt', overwrite=True),

I get the following error:

2022-08-19 05:37:57 Hail: INFO: Coerced sorted dataset

---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<timed eval> in <module>

<decorator-gen-1275> in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/opt/conda/lib/python3.6/site-packages/hail/matrixtable.py in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)
   2542 
   2543         writer = ir.MatrixNativeWriter(output, overwrite, stage_locally, _codec_spec, _partitions, _partitions_type, _checkpoint_file)
-> 2544         Env.backend().execute(ir.MatrixWrite(self._mir, writer))
   2545 
   2546     class _Show:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
    108                 raise HailUserError(message_and_trace) from None
    109 
--> 110             raise e

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     84         # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
     85         try:
---> 86             result_tuple = self._jhc.backend().executeEncode(jir, stream_codec)
     87             (result, timings) = (result_tuple._1(), result_tuple._2())
     88             value = ir.typ._from_encoding(result)

/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     29                 raise FatalError('%s\n\nJava stack trace:\n%s\n'
     30                                  'Hail version: %s\n'
---> 31                                  'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     32         except pyspark.sql.utils.CapturedException as e:
     33             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: VCFParseError: missing value in FORMAT array. Import with argument 'array_elements_required=False'

I’m using the same instance type as before, so I’m not really sure what could be going on. Sometimes when I run the code, I get a different number of Hail: INFO: Coerced sorted dataset messages too. Any ideas what’s going on? I’m about to try with a more powerful instance just to see if there’s some memory/storage issue.

  1. Dan, thank you also for the vcftools suggestion—yes I think I should look into it. I’m new to all this sort of thing and am not sure about what best practices are for what I’d like to accomplish

  2. Dan, to your point that for larger datasets, it would be better to pay the overhead of importing into a Matrix Table—when would you typically say the cost is worth it? I thought that even this UK Biobank pVCF chromosome block (~20 gb) would reach that level, but maybe you don’t think so—perhaps because it is still small enough to work with using vcftools on a single machine?

  3. I am also just realizing now that I may not be actually using Hail on multiple nodes on the cluster via Spark. I just noticed that I initiated Hail using the code in the GWAS tutorial, which is:

import hail as hl
hl.init()  
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()

When in fact it looks like, according to one of Dan’s previous posts, I should be using (at least the first time)

import pyspark
import dxpy

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
spark.sql("CREATE DATABASE my_database LOCATION  'dnax://'")

… is that right?

Best,
Jeremy