Hi,
I am trying to improve the performance of my pipeline. So, I am trying the cache option of the MatrixTable. I load a big file of 140G into a Matrixtable, and then I query the GT field and show it. I put the cache after loading the file and expect that the query “mt.GT.show()” is faster than without cache, but it’s the same time. So, how is working cache in matrixtable?
show() is only looking at a tiny amount of data, and may not be as sensitive to file localization. How long is it taking to show() the GT?
Well, for example now a count() is 4 minutes before cache and the same time after cache.
Can you share your python script?
is with python3, the shell, and very simple:
import hail as hl
hl.init()
data=hl.import_vcf(“file.vcf.gz”,force_bgz=True,reference_genome=‘GRCh38’)
data.count()
data.cache()
data.count()
cache() / persist() / checkpoint() don’t mutate – they are functional. Try
data = data.cache()
instead.
However, I’d recommend that you try data = data.checkpoint(some_mt_path)
which is a wrapper around write/read and works better than cache() in some cluster configurations.
Mmmm. Now I tried data=data.cache() but after when I try data.count() first output all my data and also output an error about:
is.hail.io.vcf.VCFParseError: missing value in FORMAT array. Import with argument ‘array_elements_required=False’
at is.hail.io.vcf.VCFLine.parseError(LoadVCF.scala:60)
at is.hail.io.vcf.VCFLine.parseIntArrayElement(LoadVCF.scala:581)
at is.hail.io.vcf.VCFLine.parseAddFormatArrayInt(LoadVCF.scala:629)
at is.hail.io.vcf.FormatParser.parseAddField(LoadVCF.scala:1029)
at is.hail.io.vcf.FormatParser.parse(LoadVCF.scala:1068)
at is.hail.io.vcf.LoadVCF$.parseLine(LoadVCF.scala:1449)
at is.hail.io.vcf.LoadVCF$.parseLine(LoadVCF.scala:1302)
Set that flag on import_vcf and it should work
I have tried now and the count after the cache is very very slow… much more than a count without cache. My data is about 140Gb, and I have enough memory, more than 300Gb.
I have also tried the checkpoint but it writes to disk so is also very slow.
I only need 2 or at most 3 queries and exports of my data so this cache/persist/checkpoint can’t last more than a simple query.
Parsing text is very slow. import_vcf() / count() wasn’t actually parsing the genotypes.
I see.
The argument ‘array_elements_required’ what exactly does?. Why I can load the file without this argument but if I use cache I need to add this argument?
from the import_vcf
docs:
-
array_elements_required (
bool
) – If True
, all elements in an array field must be present. Set this parameter to False
for Hail to allow array fields with missing values such as 1,.,5
. In this case, the second element will be missing. However, in the case of a single missing element .
, the entire field will be missing and not an array with one missing element.
You can import_vcf / count() without triggering the error because that query doesn’t require the genotypes to be loaded