Unable to import variants from structural variant VCF

Hello, I’m a new Hail user.

We have a large dataset of structural variants we would like to play around with in Hail, but I haven’t been able to import the variants from the VCF:

summary.report()

     Samples: 21704
    Variants: 0
   Call Rate: nan
     Contigs: []

Multiallelics: 0
SNPs: 0
MNPs: 0
Insertions: 0
Deletions: 0
Complex Alleles: 0
Star Alleles: 0
Max Alleles: 2

I’ve tried adding the generic=True keyword arg to HailContext.import_vcf(), but it appears to have been removed (I’m using commit 15f228906b57ca0f479ccee9f135a73bf3127860).

I haven’t had any trouble loading the example VCFs from hail/python/hail/docs/data. Are SVs supported? If so, is there something off-spec about our VCF that’s causing this to fail? Here are the first three lines of three samples from our dataset (identifying info removed):
https://drive.google.com/open?id=1PsufxDzQPjv8it8MTOENgU-ypxvbdSzQ

Thanks!
Adam

Hi Adam,

Just to note, you’re using the development version – it’s pretty unstable at the moment as we gear up for the 0.2 beta version in ~5 weeks or so. We’d recommend people use 0.1 for the time being.

However, 0.1 doesn’t support SV – not at all. 0.1 supports a narrow range of VCFs, namely those produced by GATK for germline sequencing. 0.1 removes all sites from the VCF where the ref/alt don’t match [ATGCN]+ or *.

0.2 / devel is much more general, supporting arbitrary structured matrices of any schema you want. We’ve been having some dev discussions about how to both handle the specific case of germline genomes that has motivated Hail’s development, and the general case of large structured matrices of biological data.

This example is convincing evidence that we need to err on the side of generality. I’ll add a to-do to relax the VCF parser before the beta (~5 weeks), which would mean you’d be able to load the VCF just fine.

We’ll be announcing the beta release on this forum – I hope our dev timeline isn’t an inconvenience!

also, you couldn’t find ‘generic’ because everything is generic now! The devel docs are very different from the 0.1 docs: https://www.hail.is/docs/devel/

If you have a small public SNV dataset (1kg maybe?) that you can send us to include in our repository resources and test against, that would be great!

Thank you for all of the clarifications! I look forward to future releases.

I don’t believe I have any data I can share publicly at the moment, but I will speak with my supervisor to see if that’s a possibility.

are there any public ones that look like your data that you can point to?