Export_vcf with VEP

Hi, so I managed to run vep on my original vcf.bgz file, save a mt_vep for analysis but I’d like to save a vep.vcf.bgz file as well. So I tried this first:

mt_vep = hl.read_matrix_table('gs://...vep.mt')


hl.export_vcf(mt_vep, 'gs://...vep.vcf.bgz', parallel='header_per_shard', metadata=meta_orig)

Then I saw this warning:

2020-05-12 19:45:06 Hail: WARN: export_vcf: ignored the following fields:
    'vep' (row)

I obviously missed something in https://hail.is/docs/0.2/methods/impex.html#hail.methods.export_vcf, but what exactly please?

Thanks, Alan

So I think the relevant bit from the docs is:

Hail exports the fields of struct info as INFO fields, the elements of set<str> filters as FILTERS, the value of str rsid as ID, and the value of float64 qual as QUAL. No other row fields are exported.

I’m more on the engineering side than genetics side, so I don’t really use VEP, but I think the answer here is that anything you want to export from VEP has to go into a row annotation called info.

Thanks @johnc1231. I found this as well: https://discuss.hail.is/t/vcf-ignoring-vep-and-non-reference-samples-while-exporting

I can live with that. My main target is to get the data in a way I can query, like in Amazon Athena.

Hi @alanwilter, did you find a solution to that?



If you’re trying to export the VEP CSQ field in VCF info, you can use csq=True on hl.vep

Hi @tpoterba,

Thank you for your answer.

So my steps are:

>>> import hail as hl
>>> hl.init(default_reference='GRCh38')
>>> ds = hl.import_vcf(file, force_bgz=True, contig_recoding=contig_recoding, skip_invalid_loci=True, min_partitions=500)
>>> ds = hl.vep(ds, vep_config, csq=True)
>>> ds = ds.annotate_rows(info=ds.info.annotate(vep=ds.vep))
>>> hl.export_vcf(ds, 'clinvar_20210424_GRCh38_vep.vcf.bgz')

I’m getting: is.hail.utils.HailException: VEP returned invalid variant

Any idea on how to debug that?



What’s the full error message?

It’s pretty long I’m not sure I should paste it on here.

Basically I’m getting a java.lang.ArrayIndexOutOfBoundsException: 1

My understanding is that with csq=True the vep field should be of type tstr, but I have this:

>>> ds.head(1).describe()
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
    **'vep': array<str>**
    'vep_proc_id': struct {
        part_idx: int32, 
        block_idx: int32

vep is of type array<str> instead.

Let me know if you want me to paste the full output on here.



– EDIT –

I believe this is due to my vep_config, using --json instead of --vcf when csq=True.



We need to see this, I think. Pasting the java stack trace (don’t care about the python one) here or in a github gist are good options.

The array<str> field exports just fine to VCF, I believe.