Import_vcf failure on multiple inputs

With the failing chromosomes I am able, per previous attempts, to import each of the VCFs individually (i.e., a list of one element). Potentially we can therefore get around this problem by unioning, but we’ve seen that scales quadratically, so it probably won’t be tractable.

I then tried increasing the number of VCFs per chromosome import to 2, taking a random sample from the VCFs available for said chromosome. This failed for Chr1, 15, 22, X and Y in the first run; when I tried again – because the subset is random – Chr1 and 15 went through OK, while the other three still failed… After multiple trials, all chromosomes succeed for some random subset of size 2.

This suggests to me that there is something wrong with particular VCF files, regardless of the chromosome. However, the fact that they import individually just fine is super-weird.

the fact that they import individually just fine

To clarify for my own edification – do you mean that you can do something like import_vcf and then count() just fine? Or just the import stage works?

If something is preventing worker nodes from reading the data in those files, then it’s possible that import_vcf(single_file) will work but the count() will fail.

I’m testing with both import_vcf and then count, to make sure. The guts of my script are effectively:

mt = hail.import_vcf(vcfs)
print(mt.count())

This works fine when vcfs is a list of length 1, regardless of the particular file (or chromosome, interval, etc.), but as soon as that list becomes larger, the import stage will fail in (so far as I have been able to establish) non-deterministic ways. It may consistently fail for the same files: I haven’t tried that yet, because it works with each file individually :confused: