Parsing out complex JSON or dict for variant annotations from Nirvana

Since it seems the Nirvana support in Hail 0.2.63 is no longer working (likely due to the fact that it relies on STDIN for streaming variants into Nirvana which is no longer supported in Nirvana 3.14) I am trying to run my own copy of Nirvana and annotate variants from the complex JSON string that is produced by Nirvana into a simpler version for my annotations…

I am not quite sure how this is done in Hail…How in general do you take a JSON or DICT object and annotate variants (in annotate_rows) with a subset of the information that is in the JSON object…

If I have this as my JSON/DICT object for a single variant:

sv = {'vid': '11-65918881-G-A',
 'chromosome': 'chr11',
 'begin': 65918881,
 'end': 65918881,
 'refAllele': 'G',
 'altAllele': 'A',
 'variantType': 'SNV',
 'hgvsg': 'NC_000011.10:g.65918881G>A',
 'phylopScore': 0.7,
 'regulatoryRegions': [{'id': 'ENSR00000040907',
   'type': 'promoter',
   'consequence': ['regulatory_region_variant']}],
 'dbsnp': ['rs1207174710'],
 'gnomad': {'coverage': 20,
  'failedFilter': True,
  'allAf': 3.3e-05,
  'allAn': 60266,
  'allAc': 2,
  'allHc': 0,
  'afrAf': 0,
  'afrAn': 9008,
  'afrAc': 0,
  'afrHc': 0,
  'amrAf': 0,
  'amrAn': 6528,
  'amrAc': 0,
  'amrHc': 0,
  'easAf': 0.000525,
  'easAn': 1904,
  'easAc': 1,
  'easHc': 0,
  'finAf': 0,
  'finAn': 5384,
  'finAc': 0,
  'finHc': 0,
  'nfeAf': 4e-05,
  'nfeAn': 25006,
  'nfeAc': 1,
  'nfeHc': 0,
  'asjAf': 0,
  'asjAn': 3014,
  'asjAc': 0,
  'asjHc': 0,
  'sasAf': 0,
  'sasAn': 7432,
  'sasAc': 0,
  'sasHc': 0,
  'othAf': 0,
  'othAn': 1990,
  'othAc': 0,
  'othHc': 0,
  'maleAf': 0,
  'maleAn': 34844,
  'maleAc': 0,
  'maleHc': 0,
  'femaleAf': 7.9e-05,
  'femaleAn': 25422,
  'femaleAc': 2,
  'femaleHc': 0,
  'controlsAllAf': 4.7e-05,
  'controlsAllAn': 21356,
  'controlsAllAc': 1},
 'topmed': {'allAf': 6.4e-05, 'allAn': 125568, 'allAc': 8, 'allHc': 0},
 'transcripts': [{'transcript': 'ENST00000438576.2',
   'source': 'Ensembl',
   'bioType': 'protein_coding',
   'introns': '1/1',
   'geneId': 'ENSG00000175573',
   'hgnc': 'C11orf68',
   'consequence': ['intron_variant'],
   'hgvsc': 'ENST00000438576.2:c.122+29C>T',
   'isCanonical': True,
   'proteinId': 'ENSP00000398350.2'},
  {'transcript': 'NM_001135635.1',
   'source': 'RefSeq',
   'bioType': 'protein_coding',
   'introns': '1/1',
   'geneId': '83638',
   'hgnc': 'C11orf68',
   'consequence': ['intron_variant'],
   'hgvsc': 'NM_001135635.1:c.122+29C>T',
   'isCanonical': True,
   'proteinId': 'NP_001129107.1'},
  {'transcript': 'NM_031450.3',
   'source': 'RefSeq',
   'bioType': 'protein_coding',
   'introns': '1/1',
   'geneId': '83638',
   'hgnc': 'C11orf68',
   'consequence': ['intron_variant'],
   'hgvsc': 'NM_031450.3:c.122+29C>T',
   'proteinId': 'NP_113638.2'},
  {'transcript': 'ENST00000449692.3',
   'source': 'Ensembl',
   'bioType': 'protein_coding',
   'introns': '1/1',
   'geneId': 'ENSG00000175573',
   'hgnc': 'C11orf68',
   'consequence': ['intron_variant'],
   'hgvsc': 'ENST00000449692.3:c.122+29C>T',
   'proteinId': 'ENSP00000409681.3'},
  {'transcript': 'ENST00000530188.1',
   'source': 'Ensembl',
   'bioType': 'protein_coding',
   'geneId': 'ENSG00000175573',
   'hgnc': 'C11orf68',
   'consequence': ['upstream_gene_variant'],
   'proteinId': 'ENSP00000433914.1'},
  {'transcript': 'ENST00000312515.6',
   'source': 'Ensembl',
   'bioType': 'protein_coding',
   'geneId': 'ENSG00000175550',
   'hgnc': 'DRAP1',
   'consequence': ['upstream_gene_variant'],
   'proteinId': 'ENSP00000307850.2'},
  {'transcript': 'NM_006442.3',
   'source': 'RefSeq',
   'bioType': 'protein_coding',
   'geneId': '10589',
   'hgnc': 'DRAP1',
   'consequence': ['upstream_gene_variant'],
   'isCanonical': True,
   'proteinId': 'NP_006433.2'},
  {'transcript': 'ENST00000525501.5',
   'source': 'Ensembl',
   'bioType': 'protein_coding',
   'geneId': 'ENSG00000175550',
   'hgnc': 'DRAP1',
   'consequence': ['upstream_gene_variant'],
   'proteinId': 'ENSP00000437225.1'},
  {'transcript': 'ENST00000376991.6',
   'source': 'Ensembl',
   'bioType': 'protein_coding',
   'geneId': 'ENSG00000175550',
   'hgnc': 'DRAP1',
   'consequence': ['upstream_gene_variant'],
   'isCanonical': True,
   'proteinId': 'ENSP00000366190.2'},
  {'transcript': 'ENST00000531121.5',
   'source': 'Ensembl',
   'bioType': 'retained_intron',
   'geneId': 'ENSG00000175550',
   'hgnc': 'DRAP1',
   'consequence': ['upstream_gene_variant']},
  {'transcript': 'ENST00000527119.5',
   'source': 'Ensembl',
   'bioType': 'protein_coding',
   'geneId': 'ENSG00000175550',
   'hgnc': 'DRAP1',
   'consequence': ['upstream_gene_variant'],
   'proteinId': 'ENSP00000437287.1'},
  {'transcript': 'ENST00000532933.1',
   'source': 'Ensembl',
   'bioType': 'protein_coding',
   'geneId': 'ENSG00000175550',
   'hgnc': 'DRAP1',
   'consequence': ['upstream_gene_variant'],
   'proteinId': 'ENSP00000432445.1'},
  {'transcript': 'ENST00000530791.5',
   'source': 'Ensembl',
   'bioType': 'retained_intron',
   'geneId': 'ENSG00000175550',
   'hgnc': 'DRAP1',
   'consequence': ['upstream_gene_variant']},
  {'transcript': 'ENST00000534333.1',
   'source': 'Ensembl',
   'bioType': 'retained_intron',
   'geneId': 'ENSG00000175550',
   'hgnc': 'DRAP1',
   'consequence': ['upstream_gene_variant']},
  {'transcript': 'ENST00000525190.1',
   'source': 'Ensembl',
   'bioType': 'retained_intron',
   'geneId': 'ENSG00000175550',
   'hgnc': 'DRAP1',
   'consequence': ['upstream_gene_variant']}]
}

and this as my HAIL struct definition:

nirvana_schema = '''
struct{
    chromosome: str,
    refAllele: str,
    altAlleles: array<str>,
    variants: array<
        struct{
            vid:str,
            variantType:str
        }
    >
}
'''

And I have the correct locus and allele definition for this variant defined as well

variant = hl.eval(hl.parse_variant('chr11:65918881:G:A,GCCCTGC',reference_genome='GRCh38'))
variant
Struct(locus=Locus(contig=chr11, position=65918881, reference_genome=GRCh38), alleles=['G', 'A', 'GCCCTGC'])

How do I annotate my matrix table with this limited schema, based on the large DICT/JSON i have to start with. Not all variants will have all the fields from the schema, and conversely, not all fields from the variant are represented in the schema (such as samples), so is there a way to filter my dict, based on the schema automatically, or do I have to construct my anntations manually?

I am trying to constuct a hail table from the JSON files, but not quite sure how to construct that…

I tried having it automatically parse it from a Pandas dataframe but that does not work so not sure how to do this…

Thanks

I think this workflow will get the job done:

  1. Create a text file with the nirvana annotation results, with one json object per line.
  2. Import with import_table
  3. Convert to nested structure using the hl.parse_json function
  4. Extract the locus/alleles from the nested structure, key by that, and annotate back!
1 Like

Thanks for the tips, and I was able to muddle through it…

I put together a simple Jupyter Notebook on what is involved in annotating a Hail Matrix table with (ND)JSON fields…

I did run into one issue, for which I could not directly find a solution…

Since I (deliberately) not included ALL of the fields from the JSON file in my HAIL Struct definition, I got a lot of warnings about the field that was not included in the JSON file…

Is there a way to suppress those warnings? Or is there a different way to extract subset of fields from the JSON structs>

This is what I keep seeing:

2021-03-08 18:40:42 Hail: INFO: Coerced sorted dataset
2021-03-08 18:40:42 Hail: WARN: struct{vid: str, hgvsg: str} has no field phylopScore at <root>.variants[element] for value JDouble(-0.4)
2021-03-08 18:40:42 Hail: WARN: struct{vid: str, hgvsg: str} has no field phylopScore at <root>.variants[element] for value JDouble(0.1)
2021-03-08 18:40:42 Hail: INFO: Coerced sorted dataset
2021-03-08 18:40:43 Hail: WARN: struct{vid: str, hgvsg: str} has no field phylopScore at <root>.variants[element] for value JDouble(-0.4)
2021-03-08 18:40:43 Hail: WARN: struct{vid: str, hgvsg: str} has no field phylopScore at <root>.variants[element] for value JDouble(0.1)
2021-03-08 18:40:43 Hail: WARN: struct{vid: str, hgvsg: str} has no field phylopScore at <root>.variants[element] for value JDouble(-0.4)
2021-03-08 18:40:43 Hail: WARN: struct{vid: str, hgvsg: str} has no field phylopScore at <root>.variants[element] for value JDouble(0.1)

This is the original JSON string

ndjson = '''{"chromosome": "chr11", "position": 65918810, "refAllele": "C", "altAlleles": ["G", "A"],"variants": [{"vid": "11-65918810-C-G", "hgvsg": "NC_000011.10:g.65918810C>G", "phylopScore": -0.4}, { "vid": "11-65918810-C-A", "hgvsg": "NC_000011.10:g.65918810C>A", "phylopScore": -0.8}]} 
{"chromosome": "chr11", "position": 65918812, "refAllele": "G", "altAlleles": ["A"], "variants": [{"vid": "11-65918812-G-A", "phylopScore": 0.1}]}
'''

And this was the struct definition:

nirvana_schema = '''
struct{
    chromosome: str,
    position: int32,
    refAllele: str,
    altAlleles: array<str>,
    cytogeneticBand:str,
    variants:array<struct{
        vid:str,
        hgvsg:str
    }>
}
'''

I had loaded the JSON strings into a hail table and then used this to extract the JSON struct as an expression:

json_expr = hl.parse_json(ht.f0,dtype=nirvana_schema)

More info is in the notebook in the link…Hopefully there is an easy way to avoid the warnings, since I don’t want to have to describe the complete JSON string, since it is far more complex that this toy example I provided

Glad you got it working!

re: warnings - there’s not a way to suppress them right now, but we recently (0.2.63 I think) fixed them to only warn once per Spark task, rather than once per row.

1 Like

I am using 0.2.63 now, and it shows those warnings per row, so maybe a new version will have this?

ah! We fixed this for hl.vep, not hl.parse_json. We can totally fix that too.

1 Like

should go into 0.2.64:

1 Like