Hi, so I managed to run vep
on my original vcf.bgz file, save a mt_vep for analysis but I’d like to save a vep.vcf.bgz file as well. So I tried this first:
mt_vep = hl.read_matrix_table('gs://...vep.mt')
meta_orig=hl.get_vcf_metadata('gs://...orig.vcf.bgz')
hl.export_vcf(mt_vep, 'gs://...vep.vcf.bgz', parallel='header_per_shard', metadata=meta_orig)
Then I saw this warning:
2020-05-12 19:45:06 Hail: WARN: export_vcf: ignored the following fields:
'vep' (row)
I obviously missed something in https://hail.is/docs/0.2/methods/impex.html#hail.methods.export_vcf , but what exactly please?
Thanks, Alan
So I think the relevant bit from the docs is:
Hail exports the fields of struct info as INFO fields, the elements of set<str>
filters as FILTERS, the value of str rsid as ID, and the value of float64 qual as QUAL. No other row fields are exported.
I’m more on the engineering side than genetics side, so I don’t really use VEP, but I think the answer here is that anything you want to export from VEP has to go into a row annotation called info
.
Thanks @johnc1231 . I found this as well: https://discuss.hail.is/t/vcf-ignoring-vep-and-non-reference-samples-while-exporting
I can live with that. My main target is to get the data in a way I can query, like in Amazon Athena.
TomK
May 12, 2021, 7:47pm
4
Hi @alanwilter , did you find a solution to that?
Thanks,
Tom
If you’re trying to export the VEP CSQ field in VCF info, you can use csq=True
on hl.vep
TomK
May 12, 2021, 8:38pm
6
Hi @tpoterba ,
Thank you for your answer.
So my steps are:
>>> import hail as hl
>>> hl.init(default_reference='GRCh38')
...
>>> ds = hl.import_vcf(file, force_bgz=True, contig_recoding=contig_recoding, skip_invalid_loci=True, min_partitions=500)
...
>>> ds = hl.vep(ds, vep_config, csq=True)
>>> ds = ds.annotate_rows(info=ds.info.annotate(vep=ds.vep))
>>> hl.export_vcf(ds, 'clinvar_20210424_GRCh38_vep.vcf.bgz')
I’m getting: is.hail.utils.HailException: VEP returned invalid variant
Any idea on how to debug that?
Thanks,
Tom
What’s the full error message?
TomK
May 12, 2021, 9:05pm
8
It’s pretty long I’m not sure I should paste it on here.
Basically I’m getting a java.lang.ArrayIndexOutOfBoundsException: 1
My understanding is that with csq=True the vep field should be of type tstr, but I have this:
>>> ds.head(1).describe()
Row fields:
'locus': locus<GRCh38>
'alleles': array<str>
'rsid': str
'qual': float64
'filters': set<str>
'info': struct {
...
}
**'vep': array<str>**
'vep_proc_id': struct {
part_idx: int32,
block_idx: int32
}
vep is of type array<str> instead.
Let me know if you want me to paste the full output on here.
Thanks,
Tom
– EDIT –
I believe this is due to my vep_config, using --json instead of --vcf when csq=True.
Thanks,
Tom
We need to see this, I think. Pasting the java stack trace (don’t care about the python one) here or in a github gist are good options.
The array<str>
field exports just fine to VCF, I believe.