General advice for dealing with large unordered vcf files

mrdata · September 15, 2021, 11:12pm

I have a fairly large vcf (430 MB compressed), and it contains unsorted variants. It was produced as part of a WGS analysis pipeline that I’m only vaguely familiar with. Not sure if this is relevant but it also contains many alternate contig ID’s such as:

##contig=<ID=chr16_GL383557v1_alt,length=89672>
##contig=<ID=chr17_GL383563v3_alt,length=375691>
##contig=<ID=chr17_KI270861v1_alt,length=196688>
##contig=<ID=chr22_GL383583v2_alt,length=96924>
##contig=<ID=chr22_GL383582v2_alt,length=162811>
##contig=<ID=chrX_KI270880v1_alt,length=284869>
##contig=<ID=chrX_KI270881v1_alt,length=144206>
##contig=<ID=chrUn_KI270390v1,length=2387>
##contig=<ID=chrUn_KI270387v1,length=1537>

So far, I’ve successfully imported this vcf into a matrix table object, but I cant seem to run any useful queries on it without crashing the memory. For example, when I try to display the first 5 lines [mt.show(5)], it gives me a long java stack trace error that ends with:

Error summary: IOException: No space left on device

I eventually want to annotate this vcf with gnomAD popmax values, but my inability to even display the first 5 rows is not inspiring confidence. Can anyone give me some advice on how to handle this table without overloading my memory?

tpoterba · September 15, 2021, 11:59pm

We fixed a bug that may have led to unintended “ordering unsorted dataset with network shuffle” behavior – could you try rerunning with 0.2.76?

The shuffle (distributed sort) happens on import, so all queries are gated by that. Once we make that work (or avoid it), downstream work should be pretty painless.

mrdata · September 16, 2021, 4:57am

Thanks for your response.

I was able to upgrade the version to 0.2.76, but I still ran into the same error.

At first, I was able to successfully run

mt.rows().show(5)
…
[it displayed the first 5 lines of my vcf beautifully]

But then I ran this and it gave me the same error I was getting before:

x=mt.rows()
x.select(“info”).show(5)
…
[very]
[long]
[java]
[stacktrace]
Hail version: 0.2.76-d470e2fea3f0
Error summary: IOException: No space left on device

Each time I compiled, the output indicated that it’s still doing that shuffling behavior you mentioned. Is there any way to disable that? Are there any other ways to work around this issue?

danking · September 16, 2021, 4:40pm

Hmm. Just to be clear your code is exactly:

mt = hl.read_matrix_table(...)
x = mt.rows()
x.select('info').show(5)

with no other intervening operations on mt? You should only experience a shuffle if you perform order changing operations such as key_by, order_by, and group_by (and their column-wise and row-wise equivalents).

I strongly suspect there is either another Hail bug or something in your code inadvertently triggered a shuffle.

You can’t disable shuffling. Hail’s data model fundamentally depends on ordering. We implement SQL-like joins using ordering, not hashing.

tpoterba · September 16, 2021, 4:44pm

The shuffle is happening on import_vcf. The VCF isn’t sorted according to Hail.

mrdata · September 16, 2021, 5:35pm

import hail as hl
hl.init(spark_conf={‘spark.driver.memory’:‘200g’}) ## maybe this line doesn’t do what I think it does?
rg = hl.get_reference(‘GRCh38’)
vcf = ‘/path/to/raw.vcf.gz’
mt=hl.import_vcf(vcf, reference_genome=rg, force=True)#, force_bgz?
ht=hl.read_table(“gnomad.genomes.v3.1.1.sites.ht”)
ht=ht.filter(~hl.is_missing(ht.popmax.AF)) #remove all popmax=NA
n=mt.annotate_rows(popmax=ht[mt.locus, mt.alleles].popmax.AF)
n=n.filter_rows(n.popmax < 0.01, keep=True)
hl.export_vcf(n, ‘path/to/wgsTester-filtered.vcf.gz’)
…
…
[long Java stack]
Error summary: IOException: No space left on device

I found that spark_conf option on a forum somewhere and threw that in, thinking it would expand the memory, but I’m not at all sure that’s whats happening. Am I on the right track with this?

I was also considering trying to parallelize it somehow?

Would repartitioning or increasing n_partitions on the import have any effect?

tpoterba · September 16, 2021, 5:37pm

The “no space left on device” is a disk limitation, not memory – the VCF is sorted using an out-of-core (on-disk) sorting algorithm. This is slow and space-intensive.

tpoterba · September 16, 2021, 5:38pm

We need to figure out why the VCF isn’t sorted, I think. That’s the top priority.

danking · September 17, 2021, 2:31pm

@mrdata , could you possibly post the first, say, 100 loci and variants from your VCF? Can you take a look at the rest of the VCF (e.g. using gzcat & less) and see if there’s some obvious (unordered) pattern to the loci and variants?

Topic		Replies	Views
VCF matrix table memory error? Hail Query & hailctl	0	324	September 15, 2021
Export_vcf OutOfMemoryError: Java heap space despite --driver-memory 8g Hail Query & hailctl	26	2855	January 11, 2019
Working with large VCFs (e.g. from UK Biobank) is slow Hail Query & hailctl	12	1913	August 23, 2024
Memory and disk space requirements Hail Query & hailctl	8	689	October 10, 2022
Large scale ingest Hail Query & hailctl	7	694	April 8, 2019

General advice for dealing with large unordered vcf files

Related topics