Annotation types in annotate_rows_db() with ClinVar

rodrigo.barreiro · July 18, 2022, 5:48pm

Hello Hail Team,

I was trying to find pathogenic variants carriers described in ClinVar, but when using annotate_rows_db() I couldn’t get any variant-level annotation. After the annotation these were the type of annotation I got: ‘deletion’, ‘indel’, ‘copy number gain’, ‘copy number loss’, ‘duplication’ and ‘insertion’.

I know that there are pathogenic variants on this cohort and I confirmed it using ClinVar’s vcf file and 'semi-join’ing with this cohort.

Why SNV-level variations are not annotated? Is this the expected behavior or am I doing sometging wrong?

I’ve added the code I’m using below.

Thank you for the support,
Rodrigo Barreiro

–

import hail as hl
hl.init()

mt  = hl.read_matrix_table("s3://../my_cohort.mt")

db = hl.experimental.DB(region='us', cloud='aws')
mt = mt.filter_rows(hl.len(mt.alleles) == 2)
mt = mt.key_rows_by('locus','alleles')
mt = db.annotate_rows_db(mt, 'clinvar_variant_summary')

tpoterba · July 22, 2022, 1:30pm

what’s the schema of the returned matrix table here?

mt.describe()

danking · July 22, 2022, 6:01pm

Hey @rodrigo.barreiro ,

Could you share the FTP/HTTP URL to the clinvar VCF file you used? The clinvar_variant_summary dataset must be based on a different file. Our dataset includes a staggering number of MNVs.

danking · July 22, 2022, 8:12pm

Hey @rodrigo.barreiro ,

I did a little digging on this. The latest clinvar summary file has data like this (I’ve elided some columns)

#AlleleID   Type    Assembly    Chromosome  Start   Stop    ReferenceAllele AlternateAllele PositionVCF ReferenceAlleleVCF  AlternateAlleleVCF
"15041" "Indel" "GRCh37"    "7" "4820844"   "4820847"   "na"    "na"    "4820844"   "GGAT"  "TGCTGTAAACTGTAACTGTAAA"
"15041" "Indel" "GRCh38"    "7" "4781213"   "4781216"   "na"    "na"    "4781213"   "GGAT"  "TGCTGTAAACTGTAACTGTAAA"
"15042" "Deletion"  "GRCh37"    "7" "4827361"   "4827374"   "na"    "na"    "4827360"   "GCTGCTGGACCTGCC"   "G"
"15042" "Deletion"  "GRCh38"    "7" "4787730"   "4787743"   "na"    "na"    "4787729"   "GCTGCTGGACCTGCC"   "G"
"15043"	"single nucleotide variant"	"GRCh37"	"15"	"85342440"	"85342440"	"na"	"na"	"85342440"	"G"	"A

I believe our dataset used the start and stop to construct an interval.

If your dataset these variants

7:4827360:GCTGCTGGACCTGCC:G
7:4827361:G:T
7:4827361:GCTGCTGGACCTGCC:G
7:4827363:T:A
15:85342440:G:A
15:85342440:G:T

Which annotations do you want to attach to which variants?

Topic		Replies	Views
Annotate variants with hom var samples Hail Query & hailctl	0	340	January 19, 2023
Error when trying to annotate a new row with a genotypes of the sample Hail Query & hailctl	2	329	July 13, 2023
Variant annotation in MatrixTable Hail Query & hailctl	13	785	July 3, 2020
VEP annotation errors with ClinVar Help [0.1]	8	1573	December 3, 2017
Help for annotating a matrixtable variant data in DNAnexus with gnomAD database Hail Query & hailctl	11	482	February 9, 2023

Annotation types in annotate_rows_db() with ClinVar

Related topics