Since it seems the Nirvana support in Hail 0.2.63 is no longer working (likely due to the fact that it relies on STDIN for streaming variants into Nirvana which is no longer supported in Nirvana 3.14) I am trying to run my own copy of Nirvana and annotate variants from the complex JSON string that is produced by Nirvana into a simpler version for my annotations…
I am not quite sure how this is done in Hail…How in general do you take a JSON or DICT object and annotate variants (in annotate_rows) with a subset of the information that is in the JSON object…
If I have this as my JSON/DICT object for a single variant:
sv = {'vid': '11-65918881-G-A',
'chromosome': 'chr11',
'begin': 65918881,
'end': 65918881,
'refAllele': 'G',
'altAllele': 'A',
'variantType': 'SNV',
'hgvsg': 'NC_000011.10:g.65918881G>A',
'phylopScore': 0.7,
'regulatoryRegions': [{'id': 'ENSR00000040907',
'type': 'promoter',
'consequence': ['regulatory_region_variant']}],
'dbsnp': ['rs1207174710'],
'gnomad': {'coverage': 20,
'failedFilter': True,
'allAf': 3.3e-05,
'allAn': 60266,
'allAc': 2,
'allHc': 0,
'afrAf': 0,
'afrAn': 9008,
'afrAc': 0,
'afrHc': 0,
'amrAf': 0,
'amrAn': 6528,
'amrAc': 0,
'amrHc': 0,
'easAf': 0.000525,
'easAn': 1904,
'easAc': 1,
'easHc': 0,
'finAf': 0,
'finAn': 5384,
'finAc': 0,
'finHc': 0,
'nfeAf': 4e-05,
'nfeAn': 25006,
'nfeAc': 1,
'nfeHc': 0,
'asjAf': 0,
'asjAn': 3014,
'asjAc': 0,
'asjHc': 0,
'sasAf': 0,
'sasAn': 7432,
'sasAc': 0,
'sasHc': 0,
'othAf': 0,
'othAn': 1990,
'othAc': 0,
'othHc': 0,
'maleAf': 0,
'maleAn': 34844,
'maleAc': 0,
'maleHc': 0,
'femaleAf': 7.9e-05,
'femaleAn': 25422,
'femaleAc': 2,
'femaleHc': 0,
'controlsAllAf': 4.7e-05,
'controlsAllAn': 21356,
'controlsAllAc': 1},
'topmed': {'allAf': 6.4e-05, 'allAn': 125568, 'allAc': 8, 'allHc': 0},
'transcripts': [{'transcript': 'ENST00000438576.2',
'source': 'Ensembl',
'bioType': 'protein_coding',
'introns': '1/1',
'geneId': 'ENSG00000175573',
'hgnc': 'C11orf68',
'consequence': ['intron_variant'],
'hgvsc': 'ENST00000438576.2:c.122+29C>T',
'isCanonical': True,
'proteinId': 'ENSP00000398350.2'},
{'transcript': 'NM_001135635.1',
'source': 'RefSeq',
'bioType': 'protein_coding',
'introns': '1/1',
'geneId': '83638',
'hgnc': 'C11orf68',
'consequence': ['intron_variant'],
'hgvsc': 'NM_001135635.1:c.122+29C>T',
'isCanonical': True,
'proteinId': 'NP_001129107.1'},
{'transcript': 'NM_031450.3',
'source': 'RefSeq',
'bioType': 'protein_coding',
'introns': '1/1',
'geneId': '83638',
'hgnc': 'C11orf68',
'consequence': ['intron_variant'],
'hgvsc': 'NM_031450.3:c.122+29C>T',
'proteinId': 'NP_113638.2'},
{'transcript': 'ENST00000449692.3',
'source': 'Ensembl',
'bioType': 'protein_coding',
'introns': '1/1',
'geneId': 'ENSG00000175573',
'hgnc': 'C11orf68',
'consequence': ['intron_variant'],
'hgvsc': 'ENST00000449692.3:c.122+29C>T',
'proteinId': 'ENSP00000409681.3'},
{'transcript': 'ENST00000530188.1',
'source': 'Ensembl',
'bioType': 'protein_coding',
'geneId': 'ENSG00000175573',
'hgnc': 'C11orf68',
'consequence': ['upstream_gene_variant'],
'proteinId': 'ENSP00000433914.1'},
{'transcript': 'ENST00000312515.6',
'source': 'Ensembl',
'bioType': 'protein_coding',
'geneId': 'ENSG00000175550',
'hgnc': 'DRAP1',
'consequence': ['upstream_gene_variant'],
'proteinId': 'ENSP00000307850.2'},
{'transcript': 'NM_006442.3',
'source': 'RefSeq',
'bioType': 'protein_coding',
'geneId': '10589',
'hgnc': 'DRAP1',
'consequence': ['upstream_gene_variant'],
'isCanonical': True,
'proteinId': 'NP_006433.2'},
{'transcript': 'ENST00000525501.5',
'source': 'Ensembl',
'bioType': 'protein_coding',
'geneId': 'ENSG00000175550',
'hgnc': 'DRAP1',
'consequence': ['upstream_gene_variant'],
'proteinId': 'ENSP00000437225.1'},
{'transcript': 'ENST00000376991.6',
'source': 'Ensembl',
'bioType': 'protein_coding',
'geneId': 'ENSG00000175550',
'hgnc': 'DRAP1',
'consequence': ['upstream_gene_variant'],
'isCanonical': True,
'proteinId': 'ENSP00000366190.2'},
{'transcript': 'ENST00000531121.5',
'source': 'Ensembl',
'bioType': 'retained_intron',
'geneId': 'ENSG00000175550',
'hgnc': 'DRAP1',
'consequence': ['upstream_gene_variant']},
{'transcript': 'ENST00000527119.5',
'source': 'Ensembl',
'bioType': 'protein_coding',
'geneId': 'ENSG00000175550',
'hgnc': 'DRAP1',
'consequence': ['upstream_gene_variant'],
'proteinId': 'ENSP00000437287.1'},
{'transcript': 'ENST00000532933.1',
'source': 'Ensembl',
'bioType': 'protein_coding',
'geneId': 'ENSG00000175550',
'hgnc': 'DRAP1',
'consequence': ['upstream_gene_variant'],
'proteinId': 'ENSP00000432445.1'},
{'transcript': 'ENST00000530791.5',
'source': 'Ensembl',
'bioType': 'retained_intron',
'geneId': 'ENSG00000175550',
'hgnc': 'DRAP1',
'consequence': ['upstream_gene_variant']},
{'transcript': 'ENST00000534333.1',
'source': 'Ensembl',
'bioType': 'retained_intron',
'geneId': 'ENSG00000175550',
'hgnc': 'DRAP1',
'consequence': ['upstream_gene_variant']},
{'transcript': 'ENST00000525190.1',
'source': 'Ensembl',
'bioType': 'retained_intron',
'geneId': 'ENSG00000175550',
'hgnc': 'DRAP1',
'consequence': ['upstream_gene_variant']}]
}
and this as my HAIL struct definition:
nirvana_schema = '''
struct{
chromosome: str,
refAllele: str,
altAlleles: array<str>,
variants: array<
struct{
vid:str,
variantType:str
}
>
}
'''
And I have the correct locus and allele definition for this variant defined as well
variant = hl.eval(hl.parse_variant('chr11:65918881:G:A,GCCCTGC',reference_genome='GRCh38'))
variant
Struct(locus=Locus(contig=chr11, position=65918881, reference_genome=GRCh38), alleles=['G', 'A', 'GCCCTGC'])
How do I annotate my matrix table with this limited schema, based on the large DICT/JSON i have to start with. Not all variants will have all the fields from the schema, and conversely, not all fields from the variant are represented in the schema (such as samples), so is there a way to filter my dict, based on the schema automatically, or do I have to construct my anntations manually?
I am trying to constuct a hail table from the JSON files, but not quite sure how to construct that…
I tried having it automatically parse it from a Pandas dataframe but that does not work so not sure how to do this…
Thanks