How to programmatically calculate whether a variant is insertion, deletion, duplication etc.?

Goal:
I am trying to compare variant types across two datasets and see which one is worth the subscription.

Description:
In order to achieve the above task, I want to know which database contains more variants for a particular gene, which type of variants are present in which database etc.
I have converted vcf files from both databases to hail matrix tables. Generally, we get a field which tells us the type of variant. However, in my case I don’t have any annotations to check whether a particular variant is insertion, deletion, indel etc.

Possible solutions:

  • Annotate variants with open source databases for variant type
  • Use a programmatic approach to find the variant type

Found these tools which might be of use for insertions/deletions

  1. GitHub - tseemann/snippy: Rapid haploid variant calling and core genome alignment
  2. R script taking width into account: MutationalPatterns/get_indel_context.R at master · UMCUGenetics/MutationalPatterns · GitHub
  3. manta/README.md at master · Illumina/manta · GitHub

Any suggestions on how can I achieve it in hail?

Hail has functions for this, like hl.allele_type. If you have biallelic variants, you can add this field with:

mt = mt.annotate_rows(allele_type = hl.allele_type(mt.alleles[0], mt.alleles[1]))
1 Like