General advice for dealing with large unordered vcf files

I have a fairly large vcf (430 MB compressed), and it contains unsorted variants. It was produced as part of a WGS analysis pipeline that I’m only vaguely familiar with. Not sure if this is relevant but it also contains many alternate contig ID’s such as:

##contig=<ID=chr16_GL383557v1_alt,length=89672>
##contig=<ID=chr17_GL383563v3_alt,length=375691>
##contig=<ID=chr17_KI270861v1_alt,length=196688>
##contig=<ID=chr22_GL383583v2_alt,length=96924>
##contig=<ID=chr22_GL383582v2_alt,length=162811>
##contig=<ID=chrX_KI270880v1_alt,length=284869>
##contig=<ID=chrX_KI270881v1_alt,length=144206>
##contig=<ID=chrUn_KI270390v1,length=2387>
##contig=<ID=chrUn_KI270387v1,length=1537>

So far, I’ve successfully imported this vcf into a matrix table object, but I cant seem to run any useful queries on it without crashing the memory. For example, when I try to display the first 5 lines [mt.show(5)], it gives me a long java stack trace error that ends with:

Error summary: IOException: No space left on device

I eventually want to annotate this vcf with gnomAD popmax values, but my inability to even display the first 5 rows is not inspiring confidence. Can anyone give me some advice on how to handle this table without overloading my memory?

We fixed a bug that may have led to unintended “ordering unsorted dataset with network shuffle” behavior – could you try rerunning with 0.2.76?

The shuffle (distributed sort) happens on import, so all queries are gated by that. Once we make that work (or avoid it), downstream work should be pretty painless.

Thanks for your response.

I was able to upgrade the version to 0.2.76, but I still ran into the same error.

At first, I was able to successfully run

mt.rows().show(5)

[it displayed the first 5 lines of my vcf beautifully]

But then I ran this and it gave me the same error I was getting before:

x=mt.rows()
x.select(“info”).show(5)

[very]
[long]
[java]
[stacktrace]
Hail version: 0.2.76-d470e2fea3f0
Error summary: IOException: No space left on device

Each time I compiled, the output indicated that it’s still doing that shuffling behavior you mentioned. Is there any way to disable that? Are there any other ways to work around this issue?

Hmm. Just to be clear your code is exactly:

mt = hl.read_matrix_table(...)
x = mt.rows()
x.select('info').show(5)

with no other intervening operations on mt? You should only experience a shuffle if you perform order changing operations such as key_by, order_by, and group_by (and their column-wise and row-wise equivalents).

I strongly suspect there is either another Hail bug or something in your code inadvertently triggered a shuffle.


You can’t disable shuffling. Hail’s data model fundamentally depends on ordering. We implement SQL-like joins using ordering, not hashing.

The shuffle is happening on import_vcf. The VCF isn’t sorted according to Hail.

import hail as hl
hl.init(spark_conf={‘spark.driver.memory’:‘200g’}) ## maybe this line doesn’t do what I think it does?
rg = hl.get_reference(‘GRCh38’)
vcf = ‘/path/to/raw.vcf.gz’
mt=hl.import_vcf(vcf, reference_genome=rg, force=True)#, force_bgz?
ht=hl.read_table(“gnomad.genomes.v3.1.1.sites.ht”)
ht=ht.filter(~hl.is_missing(ht.popmax.AF)) #remove all popmax=NA
n=mt.annotate_rows(popmax=ht[mt.locus, mt.alleles].popmax.AF)
n=n.filter_rows(n.popmax < 0.01, keep=True)
hl.export_vcf(n, ‘path/to/wgsTester-filtered.vcf.gz’)


[long Java stack]
Error summary: IOException: No space left on device

I found that spark_conf option on a forum somewhere and threw that in, thinking it would expand the memory, but I’m not at all sure that’s whats happening. Am I on the right track with this?

I was also considering trying to parallelize it somehow?

Would repartitioning or increasing n_partitions on the import have any effect?

The “no space left on device” is a disk limitation, not memory – the VCF is sorted using an out-of-core (on-disk) sorting algorithm. This is slow and space-intensive.

We need to figure out why the VCF isn’t sorted, I think. That’s the top priority.

@mrdata , could you possibly post the first, say, 100 loci and variants from your VCF? Can you take a look at the rest of the VCF (e.g. using gzcat & less) and see if there’s some obvious (unordered) pattern to the loci and variants?