Annotation types in annotate_rows_db() with ClinVar

Hello Hail Team,

I was trying to find pathogenic variants carriers described in ClinVar, but when using annotate_rows_db() I couldn’t get any variant-level annotation. After the annotation these were the type of annotation I got: ‘deletion’, ‘indel’, ‘copy number gain’, ‘copy number loss’, ‘duplication’ and ‘insertion’.

I know that there are pathogenic variants on this cohort and I confirmed it using ClinVar’s vcf file and 'semi-join’ing with this cohort.

Why SNV-level variations are not annotated? Is this the expected behavior or am I doing sometging wrong?

I’ve added the code I’m using below.

Thank you for the support,
Rodrigo Barreiro

import hail as hl
hl.init()

mt  = hl.read_matrix_table("s3://../my_cohort.mt")

db = hl.experimental.DB(region='us', cloud='aws')
mt = mt.filter_rows(hl.len(mt.alleles) == 2)
mt = mt.key_rows_by('locus','alleles')
mt = db.annotate_rows_db(mt, 'clinvar_variant_summary')

what’s the schema of the returned matrix table here?

mt.describe()

Hey @rodrigo.barreiro ,

Could you share the FTP/HTTP URL to the clinvar VCF file you used? The clinvar_variant_summary dataset must be based on a different file. Our dataset includes a staggering number of MNVs.

Hey @rodrigo.barreiro ,

I did a little digging on this. The latest clinvar summary file has data like this (I’ve elided some columns)

#AlleleID   Type    Assembly    Chromosome  Start   Stop    ReferenceAllele AlternateAllele PositionVCF ReferenceAlleleVCF  AlternateAlleleVCF
"15041" "Indel" "GRCh37"    "7" "4820844"   "4820847"   "na"    "na"    "4820844"   "GGAT"  "TGCTGTAAACTGTAACTGTAAA"
"15041" "Indel" "GRCh38"    "7" "4781213"   "4781216"   "na"    "na"    "4781213"   "GGAT"  "TGCTGTAAACTGTAACTGTAAA"
"15042" "Deletion"  "GRCh37"    "7" "4827361"   "4827374"   "na"    "na"    "4827360"   "GCTGCTGGACCTGCC"   "G"
"15042" "Deletion"  "GRCh38"    "7" "4787730"   "4787743"   "na"    "na"    "4787729"   "GCTGCTGGACCTGCC"   "G"
"15043"	"single nucleotide variant"	"GRCh37"	"15"	"85342440"	"85342440"	"na"	"na"	"85342440"	"G"	"A

I believe our dataset used the start and stop to construct an interval.

If your dataset these variants

7:4827360:GCTGCTGGACCTGCC:G
7:4827361:G:T
7:4827361:GCTGCTGGACCTGCC:G
7:4827363:T:A
15:85342440:G:A
15:85342440:G:T

Which annotations do you want to attach to which variants?