Thank you Tim and Dan for the super quick and helpful responses! A few follow-ups:
-
Tim, I didn’t realize the distinction between reading in a VCF and reading in a Matrix Table! Now, I see that this idea was mentioned in this line in the Hail GWAS tutorial. Also glad that I somehow stumbled upon a bug—hope it helps others.
-
Both, for some reason now when I’m trying to read the VCF and write the Matrix Table with the command
hl.import_vcf('file:////mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157_c3_b33_v1.vcf.gz', force_bgz=True, reference_genome="GRCh38").write('ukb23157_c3_b33_v1.mt', overwrite=True)
,
I get the following error:
2022-08-19 05:37:57 Hail: INFO: Coerced sorted dataset
---------------------------------------------------------------------------
FatalError Traceback (most recent call last)
<timed eval> in <module>
<decorator-gen-1275> in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)
/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
575 def wrapper(__original_func, *args, **kwargs):
576 args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577 return __original_func(*args_, **kwargs_)
578
579 return wrapper
/opt/conda/lib/python3.6/site-packages/hail/matrixtable.py in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)
2542
2543 writer = ir.MatrixNativeWriter(output, overwrite, stage_locally, _codec_spec, _partitions, _partitions_type, _checkpoint_file)
-> 2544 Env.backend().execute(ir.MatrixWrite(self._mir, writer))
2545
2546 class _Show:
/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
108 raise HailUserError(message_and_trace) from None
109
--> 110 raise e
/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
84 # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
85 try:
---> 86 result_tuple = self._jhc.backend().executeEncode(jir, stream_codec)
87 (result, timings) = (result_tuple._1(), result_tuple._2())
88 value = ir.typ._from_encoding(result)
/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
29 raise FatalError('%s\n\nJava stack trace:\n%s\n'
30 'Hail version: %s\n'
---> 31 'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
32 except pyspark.sql.utils.CapturedException as e:
33 raise FatalError('%s\n\nJava stack trace:\n%s\n'
FatalError: VCFParseError: missing value in FORMAT array. Import with argument 'array_elements_required=False'
I’m using the same instance type as before, so I’m not really sure what could be going on. Sometimes when I run the code, I get a different number of Hail: INFO: Coerced sorted dataset
messages too. Any ideas what’s going on? I’m about to try with a more powerful instance just to see if there’s some memory/storage issue.
-
Dan, thank you also for the vcftools suggestion—yes I think I should look into it. I’m new to all this sort of thing and am not sure about what best practices are for what I’d like to accomplish
-
Dan, to your point that for larger datasets, it would be better to pay the overhead of importing into a Matrix Table—when would you typically say the cost is worth it? I thought that even this UK Biobank pVCF chromosome block (~20 gb) would reach that level, but maybe you don’t think so—perhaps because it is still small enough to work with using vcftools on a single machine?
-
I am also just realizing now that I may not be actually using Hail on multiple nodes on the cluster via Spark. I just noticed that I initiated Hail using the code in the GWAS tutorial, which is:
import hail as hl
hl.init()
from hail.plot import show
from pprint import pprint
hl.plot.output_notebook()
When in fact it looks like, according to one of Dan’s previous posts, I should be using (at least the first time)
import pyspark
import dxpy
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
spark.sql("CREATE DATABASE my_database LOCATION 'dnax://'")
… is that right?
Best,
Jeremy