Entry fields empty after import_vcf from 1000 genomes data

I am trying to learn how to use hail devel by studying the 1000 genomes data, but when I import a vcf the “Entry fields” are empty. I read the documentation for the import_vcf function and can’t figure out what I am doing wrong.

I am running hail on GCP dataproc and I used the very helpful cloudtools functions to start and connect to my cluster.

import hail as hl
hl.init() # version devel-8650fd3cdd20

mt = hl.import_vcf(‘gs://genomics-public-data/1000-genomes-phase-3/vcf-20150220/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf’)

mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str 
----------------------------------------
Row fields:
    'locus': locus<GRCh37> 
    'alleles': array<str> 
    'rsid': str 
    'qual': float64 
    'filters': set<str> 
    'info': struct {
        CIEND: array<int32>, 
        CIPOS: array<int32>, 
        CS: str, 
        END: int32, 
        IMPRECISE: bool, 
        MC: array<str>, 
        MEINFO: array<str>, 
        MEND: int32, 
        MLEN: int32, 
        MSTART: int32, 
        SVLEN: array<int32>, 
        SVTYPE: str, 
        TSD: str, 
        AC: array<int32>, 
        AF: array<float64>, 
        NS: int32, 
        AN: int32, 
        EAS_AF: array<float64>, 
        EUR_AF: array<float64>, 
        AFR_AF: array<float64>, 
        AMR_AF: array<float64>, 
        SAS_AF: array<float64>, 
        DP: int32, 
        AA: str, 
        VT: array<str>, 
        EX_TARGET: bool, 
        MULTI_ALLELIC: bool
    } 
----------------------------------------
Entry fields:
    None
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
Partition key: ['locus']
----------------------------------------

this VCF is 4.1; we are usually seeing v4.2 vcfs. Maybe we should throw an error on 4.1.

The primary probem here is that the FORMAT fields aren’t declared in the header!

Thanks Tim for your superfast reply!!

I just checked out the header for the VCF file and I think it does have some FORMAT information in the header. What is the minimum that import_vcf needs?

bcftools view -h ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
##fileformat=VCFv4.1

##FORMAT=<ID=GT,Number=1,Type=String,Description=“Genotype”>

I naively tried to load the file using import_matrix_table and manually specifying the rows, but it did not work.

mt = hl.import_matrix_table('gs://genomics-public-data/1000-genomes-phase-3/vcf-20150220/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf',
                           row_fields={'locus': hl.tlocus,
                                       'alleles': hl.tarray,
                                       'rsid': hl.tstr,
                                       'qual': hl.tfloat64,
                                       'filters': hl.tset,
                                       'info': {
                                           'CIEND': hl.tarray(hl.tint32), 
                                           'CIPOS': hl.tarray(hl.tint32), 
                                           'CS': hl.tstr, 
                                           'END': hl.tint32, 
                                           'IMPRECISE': hl.tbool,
                                           'MC': hl.tarray(hl.tstr),
                                           'MEINFO': hl.tarray(hl.tstr),
                                           'MEND': hl.tint32,  
                                           'MLEN': hl.tint32,
                                           'MSTART': hl.tint32,
                                           'SVLEN': hl.tarray(hl.tint32),
                                           'SVTYPE': hl.tstr,
                                           'TSD': hl.tstr, 
                                           'AC': hl.tarray(hl.tint32), 
                                           'AF': hl.tarray(hl.tfloat), 
                                           'NS': hl.tint32, 
                                           'AN': hl.tint32,
                                           'EAS_AF': hl.tarray(hl.tfloat), 
                                           'EUR_AF': hl.tarray(hl.tfloat), 
                                           'AFR_AF': hl.tarray(hl.tfloat), 
                                           'AMR_AF': hl.tarray(hl.tfloat), 
                                           'SAS_AF': hl.tarray(hl.tfloat), 
                                           'DP': hl.tint32, 
                                           'AA': hl.tstr, 
                                           'VT': hl.tarray(hl.tstr), 
                                           'EX_TARGET': hl.tbool, 
                                           'MULTI_ALLELIC': hl.tbool}
                                      },
                            row_key = ['locus', 'alleles']
                           )

TypeError: import_matrix_table: parameter ‘row_fields’: expected dict[str, (hail.expr.types.HailType or (str))], found dict: {‘locus’:…

Thanks!

It may be possible to import a VCF using import_matrix_table with some more wrangling (you’ll need to set entry_type to a string or call, and the info field to a string (the VCF info field is a bit special and crazy, the import_vcf has a bunch of specialized code).

However, that’s definitely not something I’d recommend! I think this may just be some bug inside Hail or the way we call out to HTSJDK, so I’ve made an issue to track that - https://github.com/hail-is/hail/issues/4160

I’m on vacation the rest of this week but randomly assigned Jon to take a look.

Thanks Tim. Enjoy your vacation! Very thankful for the support you and your team provide!

Our colleague Liam has made 1000 genomes phase 3 matrix tables publicly available at:

https://console.cloud.google.com/storage/browser/hail-datasets/hail-data/

Not sure what’s going on yet, but that should allow you to experiment in the meantime!

1 Like

So Liam found that the genomics-public-data VCF hosted by Google is missing the line ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> in the header. He created the matrix tables above using the files directly hosted by EBI:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

can we give Google a fixed VCF?

@jbloom @tpoterba thanks again for looking into this! The amazing support your team provides for hail is one of the many reasons I am excited to learn more about hail and share it with my colleagues (currently transitioning from array data using plink to ngs).

I’ve sent an email to GCP support to see if we can get this fixed for posterity.