Entry fields empty after import_vcf from 1000 genomes data


#1

I am trying to learn how to use hail devel by studying the 1000 genomes data, but when I import a vcf the “Entry fields” are empty. I read the documentation for the import_vcf function and can’t figure out what I am doing wrong.

I am running hail on GCP dataproc and I used the very helpful cloudtools functions to start and connect to my cluster.

import hail as hl
hl.init() # version devel-8650fd3cdd20

mt = hl.import_vcf(‘gs://genomics-public-data/1000-genomes-phase-3/vcf-20150220/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf’)

mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str 
----------------------------------------
Row fields:
    'locus': locus<GRCh37> 
    'alleles': array<str> 
    'rsid': str 
    'qual': float64 
    'filters': set<str> 
    'info': struct {
        CIEND: array<int32>, 
        CIPOS: array<int32>, 
        CS: str, 
        END: int32, 
        IMPRECISE: bool, 
        MC: array<str>, 
        MEINFO: array<str>, 
        MEND: int32, 
        MLEN: int32, 
        MSTART: int32, 
        SVLEN: array<int32>, 
        SVTYPE: str, 
        TSD: str, 
        AC: array<int32>, 
        AF: array<float64>, 
        NS: int32, 
        AN: int32, 
        EAS_AF: array<float64>, 
        EUR_AF: array<float64>, 
        AFR_AF: array<float64>, 
        AMR_AF: array<float64>, 
        SAS_AF: array<float64>, 
        DP: int32, 
        AA: str, 
        VT: array<str>, 
        EX_TARGET: bool, 
        MULTI_ALLELIC: bool
    } 
----------------------------------------
Entry fields:
    None
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
Partition key: ['locus']
----------------------------------------

#2

this VCF is 4.1; we are usually seeing v4.2 vcfs. Maybe we should throw an error on 4.1.

The primary probem here is that the FORMAT fields aren’t declared in the header!


#3

Thanks Tim for your superfast reply!!

I just checked out the header for the VCF file and I think it does have some FORMAT information in the header. What is the minimum that import_vcf needs?

bcftools view -h ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
##fileformat=VCFv4.1

##FORMAT=<ID=GT,Number=1,Type=String,Description=“Genotype”>

I naively tried to load the file using import_matrix_table and manually specifying the rows, but it did not work.

mt = hl.import_matrix_table('gs://genomics-public-data/1000-genomes-phase-3/vcf-20150220/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf',
                           row_fields={'locus': hl.tlocus,
                                       'alleles': hl.tarray,
                                       'rsid': hl.tstr,
                                       'qual': hl.tfloat64,
                                       'filters': hl.tset,
                                       'info': {
                                           'CIEND': hl.tarray(hl.tint32), 
                                           'CIPOS': hl.tarray(hl.tint32), 
                                           'CS': hl.tstr, 
                                           'END': hl.tint32, 
                                           'IMPRECISE': hl.tbool,
                                           'MC': hl.tarray(hl.tstr),
                                           'MEINFO': hl.tarray(hl.tstr),
                                           'MEND': hl.tint32,  
                                           'MLEN': hl.tint32,
                                           'MSTART': hl.tint32,
                                           'SVLEN': hl.tarray(hl.tint32),
                                           'SVTYPE': hl.tstr,
                                           'TSD': hl.tstr, 
                                           'AC': hl.tarray(hl.tint32), 
                                           'AF': hl.tarray(hl.tfloat), 
                                           'NS': hl.tint32, 
                                           'AN': hl.tint32,
                                           'EAS_AF': hl.tarray(hl.tfloat), 
                                           'EUR_AF': hl.tarray(hl.tfloat), 
                                           'AFR_AF': hl.tarray(hl.tfloat), 
                                           'AMR_AF': hl.tarray(hl.tfloat), 
                                           'SAS_AF': hl.tarray(hl.tfloat), 
                                           'DP': hl.tint32, 
                                           'AA': hl.tstr, 
                                           'VT': hl.tarray(hl.tstr), 
                                           'EX_TARGET': hl.tbool, 
                                           'MULTI_ALLELIC': hl.tbool}
                                      },
                            row_key = ['locus', 'alleles']
                           )

TypeError: import_matrix_table: parameter ‘row_fields’: expected dict[str, (hail.expr.types.HailType or (str))], found dict: {‘locus’:…

Thanks!


#4

It may be possible to import a VCF using import_matrix_table with some more wrangling (you’ll need to set entry_type to a string or call, and the info field to a string (the VCF info field is a bit special and crazy, the import_vcf has a bunch of specialized code).

However, that’s definitely not something I’d recommend! I think this may just be some bug inside Hail or the way we call out to HTSJDK, so I’ve made an issue to track that - https://github.com/hail-is/hail/issues/4160

I’m on vacation the rest of this week but randomly assigned Jon to take a look.


#5

Thanks Tim. Enjoy your vacation! Very thankful for the support you and your team provide!


#6

Our colleague Liam has made 1000 genomes phase 3 matrix tables publicly available at:

https://console.cloud.google.com/storage/browser/hail-datasets/hail-data/

Not sure what’s going on yet, but that should allow you to experiment in the meantime!


#7

So Liam found that the genomics-public-data VCF hosted by Google is missing the line ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> in the header. He created the matrix tables above using the files directly hosted by EBI:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/


#8

can we give Google a fixed VCF?


#9

@jbloom @tpoterba thanks again for looking into this! The amazing support your team provides for hail is one of the many reasons I am excited to learn more about hail and share it with my colleagues (currently transitioning from array data using plink to ngs).


#10

I’ve sent an email to GCP support to see if we can get this fixed for posterity.