MatrixTable cache

Santiago · September 22, 2020, 11:23am

Hi,

I am trying to improve the performance of my pipeline. So, I am trying the cache option of the MatrixTable. I load a big file of 140G into a Matrixtable, and then I query the GT field and show it. I put the cache after loading the file and expect that the query “mt.GT.show()” is faster than without cache, but it’s the same time. So, how is working cache in matrixtable?

tpoterba · September 22, 2020, 12:01pm

show() is only looking at a tiny amount of data, and may not be as sensitive to file localization. How long is it taking to show() the GT?

Santiago · September 22, 2020, 12:17pm

Well, for example now a count() is 4 minutes before cache and the same time after cache.

tpoterba · September 22, 2020, 12:18pm

Can you share your python script?

Santiago · September 22, 2020, 12:20pm

is with python3, the shell, and very simple:
import hail as hl
hl.init()
data=hl.import_vcf(“file.vcf.gz”,force_bgz=True,reference_genome=‘GRCh38’)
data.count()
data.cache()
data.count()

tpoterba · September 22, 2020, 12:24pm

cache() / persist() / checkpoint() don’t mutate – they are functional. Try

data = data.cache()

instead.

However, I’d recommend that you try data = data.checkpoint(some_mt_path) which is a wrapper around write/read and works better than cache() in some cluster configurations.

Santiago · September 22, 2020, 12:38pm

Mmmm. Now I tried data=data.cache() but after when I try data.count() first output all my data and also output an error about:

is.hail.io.vcf.VCFParseError: missing value in FORMAT array. Import with argument ‘array_elements_required=False’
at is.hail.io.vcf.VCFLine.parseError(LoadVCF.scala:60)
at is.hail.io.vcf.VCFLine.parseIntArrayElement(LoadVCF.scala:581)
at is.hail.io.vcf.VCFLine.parseAddFormatArrayInt(LoadVCF.scala:629)
at is.hail.io.vcf.FormatParser.parseAddField(LoadVCF.scala:1029)
at is.hail.io.vcf.FormatParser.parse(LoadVCF.scala:1068)
at is.hail.io.vcf.LoadVCF$.parseLine(LoadVCF.scala:1449)
at is.hail.io.vcf.LoadVCF$.parseLine(LoadVCF.scala:1302)

tpoterba · September 22, 2020, 12:47pm

Set that flag on import_vcf and it should work

Santiago · September 22, 2020, 1:10pm

I have tried now and the count after the cache is very very slow… much more than a count without cache. My data is about 140Gb, and I have enough memory, more than 300Gb.
I have also tried the checkpoint but it writes to disk so is also very slow.

I only need 2 or at most 3 queries and exports of my data so this cache/persist/checkpoint can’t last more than a simple query.

tpoterba · September 22, 2020, 1:19pm

Parsing text is very slow. import_vcf() / count() wasn’t actually parsing the genotypes.

Santiago · September 22, 2020, 4:16pm

I see.

The argument ‘array_elements_required’ what exactly does?. Why I can load the file without this argument but if I use cache I need to add this argument?

tpoterba · September 22, 2020, 4:25pm

from the import_vcf docs:

array_elements_required ( bool ) – If True , all elements in an array field must be present. Set this parameter to False for Hail to allow array fields with missing values such as 1,.,5 . In this case, the second element will be missing. However, in the case of a single missing element . , the entire field will be missing and not an array with one missing element.

You can import_vcf / count() without triggering the error because that query doesn’t require the genotypes to be loaded

Topic		Replies	Views
hail.MatrixTable and Spark UI Storage tab Help [0.1]	4	776	August 30, 2018
Hashing of MatrixTable Hail Query & hailctl	1	305	February 6, 2023
Performance of writing matrixtable on 0.2 Help [0.1]	9	1849	September 29, 2018
Table.annotate takes a while Hail Query & hailctl	6	403	March 15, 2021
Is it possible to use matrix with a database tool? Hail Query & hailctl	3	458	February 24, 2022

MatrixTable cache

Related topics