Most severe VEP consequence

johnnyr · July 3, 2025, 2:52pm

I have annotated a VCF with VEP outside of hail and have used ‘annotate_rows’ in hail to get the desired annotation.
So far so good, but due to this, I cannot use e.g. gnomad.utils (or at least don’t know how) to easily annotate and filter for ‘most_severe_consequences’, to keep only one annotation per locus.
I know, this can be done while using VEP, but for some analyses, I’d like to keep everything.
Any suggestions how to achieve this.

Similarly, I have annotated a ‘score’ based on some conditions. Could I alternatively use this, to e.g. select the transcript with the lowest score per locus ?

ocus	alleles	Consequence	HGVSc
locus	array	str	str
chr1:156081600	[“C”,“T”]	“upstream_gene_variant”	“”
chr1:156081600	[“C”,“T”]	“non_coding_transcript_exon_variant”	“ENST00000442784.1:n.105G>A”
chr1:156081600	[“C”,“T”]	“upstream_gene_variant”	“”
chr1:156081600	[“C”,“T”]	“upstream_gene_variant”	“”
chr1:156081600	[“C”,“T”]	“upstream_gene_variant”	“”
chr1:156081600	[“C”,“T”]	“synonymous_variant”	“ENST00000532414.3:c.399G>A”
chr1:156081600	[“C”,“T”]	“upstream_gene_variant”	“”
chr1:156081600	[“C”,“T”]	“upstream_gene_variant”	“”
chr1:156081600	[“C”,“T”]	“upstream_gene_variant”	“”
chr1:156081600	[“C”,“T”]	“synonymous_variant”	“NM_001093725.2:c.399G>A”

patrick-schultz · July 3, 2025, 3:12pm

Hi @johnnyr, I’m not sure I understand what you want to achieve. Is the issue that there are multiple rows for the same locus and alleles? Can you share the annotate_rows code you used to add the annotations?

johnnyr · July 3, 2025, 3:46pm

Thanks for the quick response. Maybe I share it from the beginning. It might not be the most efficient or clean code, but for getting the data into hail it works fine:

1.) I make sure to get the right structure of the VEP annotation:

!cat myvcf | grep "##" | grep "CSQ" > myvcf.header.txt

So I am getting something like this in the “myvcf.header.txt”:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|UNIPROT_ISOFORM|REFSEQ_MATCH|SOURCE|REFSEQ_OFFSET|GENE_PHENO|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|gnomADg_AF|gnomADg_AFR_AF|gnomADg_AMI_AF|gnomADg_AMR_AF|gnomADg_ASJ_AF|gnomADg_EAS_AF|gnomADg_FIN_AF|gnomADg_MID_AF|gnomADg_NFE_AF|gnomADg_OTH_AF|gnomADg_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|1000Gp3_AC|1000Gp3_AF|1000Gp3_AFR_AC|1000Gp3_AFR_AF|1000Gp3_AMR_AC|1000Gp3_AMR_AF|1000Gp3_EAS_AC|1000Gp3_EAS_AF|1000Gp3_EUR_AC|1000Gp3_EUR_AF|1000Gp3_SAS_AC|1000Gp3_SAS_AF|ALFA_African_AC|ALFA_African_AF|ALFA_African_AN|ALFA_African_American_AC|ALFA_African_American_AF|ALFA_African_American_AN|ALFA_African_Others_AC|ALFA_African_Others_AF|ALFA_African_Others_AN|ALFA_Asian_AC|ALFA_Asian_AF|ALFA_Asian_AN|ALFA_East_Asian_AC|ALFA_East_Asian_AF|ALFA_East_Asian_AN|ALFA_European_AC|ALFA_European_AF|ALFA_European_AN|ALFA_Latin_American_1_AC|ALFA_Latin_American_1_AF|ALFA_Latin_American_1_AN|ALFA_Latin_American_2_AC|ALFA_Latin_American_2_AF|ALFA_Latin_American_2_AN|ALFA_Other_AC|ALFA_Other_AF|ALFA_Other_AN|ALFA_Other_Asian_AC|ALFA_Other_Asian_AF|ALFA_Other_Asian_AN|ALFA_South_Asian_AC|ALFA_South_Asian_AF|ALFA_South_Asian_AN|ALFA_Total_AC|ALFA_Total_AF|ALFA_Total_AN|ALSPAC_AC|ALSPAC_AF|APPRIS|Aloft_Confidence|Aloft_Fraction_transcripts_affected|Aloft_pred|Aloft_prob_Dominant|Aloft_prob_Recessive|Aloft_prob_Tolerant|AlphaMissense_pred|AlphaMissense_rankscore|AlphaMissense_score|AltaiNeandertal|Ancestral_allele|BayesDel_addAF_pred|BayesDel_addAF_rankscore|BayesDel_addAF_score|BayesDel_noAF_pred|BayesDel_noAF_rankscore|BayesDel_noAF_score|CADD_phred|CADD_phred_hg19|CADD_raw|CADD_raw_hg19|CADD_raw_rankscore|CADD_raw_rankscore_hg19|ChagyrskayaNeandertal|ClinPred_pred|ClinPred_rankscore|ClinPred_score|DANN_rankscore|DANN_score|DEOGEN2_pred|DEOGEN2_rankscore|DEOGEN2_score|Denisova|ESM1b_pred|ESM1b_rankscore|ESM1b_score|ESP6500_AA_AC|ESP6500_AA_AF|ESP6500_EA_AC|ESP6500_EA_AF|EVE_Class10_pred|EVE_Class20_pred|EVE_Class25_pred|EVE_Class30_pred|EVE_Class40_pred|EVE_Class50_pred|EVE_Class60_pred|EVE_Class70_pred|EVE_Class75_pred|EVE_Class80_pred|EVE_Class90_pred|EVE_rankscore|EVE_score|Eigen-PC-phred_coding|Eigen-PC-raw_coding|Eigen-PC-raw_coding_rankscore|Eigen-phred_coding|Eigen-raw_coding|Eigen-raw_coding_rankscore|Ensembl_geneid|Ensembl_proteinid|Ensembl_transcriptid|ExAC_AC|ExAC_AF|ExAC_AFR_AC|ExAC_AFR_AF|ExAC_AMR_AC|ExAC_AMR_AF|ExAC_Adj_AC|ExAC_Adj_AF|ExAC_EAS_AC|ExAC_EAS_AF|ExAC_FIN_AC|ExAC_FIN_AF|ExAC_NFE_AC|ExAC_NFE_AF|ExAC_SAS_AC|ExAC_SAS_AF|ExAC_nonTCGA_AC|ExAC_nonTCGA_AF|ExAC_nonTCGA_AFR_AC|ExAC_nonTCGA_AFR_AF|ExAC_nonTCGA_AMR_AC|ExAC_nonTCGA_AMR_AF|ExAC_nonTCGA_Adj_AC|ExAC_nonTCGA_Adj_AF|ExAC_nonTCGA_EAS_AC|ExAC_nonTCGA_EAS_AF|ExAC_nonTCGA_FIN_AC|ExAC_nonTCGA_FIN_AF|ExAC_nonTCGA_NFE_AC|ExAC_nonTCGA_NFE_AF|ExAC_nonTCGA_SAS_AC|ExAC_nonTCGA_SAS_AF|ExAC_nonpsych_AC|ExAC_nonpsych_AF|ExAC_nonpsych_AFR_AC|ExAC_nonpsych_AFR_AF|ExAC_nonpsych_AMR_AC|ExAC_nonpsych_AMR_AF|ExAC_nonpsych_Adj_AC|ExAC_nonpsych_Adj_AF|ExAC_nonpsych_EAS_AC|ExAC_nonpsych_EAS_AF|ExAC_nonpsych_FIN_AC|ExAC_nonpsych_FIN_AF|ExAC_nonpsych_NFE_AC|ExAC_nonpsych_NFE_AF|ExAC_nonpsych_SAS_AC|ExAC_nonpsych_SAS_AF|FATHMM_converted_rankscore|FATHMM_pred|FATHMM_score|GENCODE_basic|GERP++_NR|GERP++_RS|GERP++_RS_rankscore|GM12878_confidence_value|GM12878_fitCons_rankscore|GM12878_fitCons_score|GTEx_V8_eQTL_gene|GTEx_V8_eQTL_tissue|GTEx_V8_sQTL_gene|GTEx_V8_sQTL_tissue|GenoCanyon_rankscore|GenoCanyon_score|Geuvadis_eQTL_target_gene|H1-hESC_confidence_value|H1-hESC_fitCons_rankscore|H1-hESC_fitCons_score|HGVSc_ANNOVAR|HGVSc_VEP|HGVSc_snpEff|HGVSp_ANNOVAR|HGVSp_VEP|HGVSp_snpEff|HUVEC_confidence_value|HUVEC_fitCons_rankscore|HUVEC_fitCons_score|Interpro_domain|LINSIGHT|LINSIGHT_rankscore|LIST-S2_pred|LIST-S2_rankscore|LIST-S2_score|LRT_Omega|LRT_converted_rankscore|LRT_pred|LRT_score|M-CAP_pred|M-CAP_rankscore|M-CAP_score|MPC_rankscore|MPC_score|MVP_rankscore|MVP_score|MetaLR_pred|MetaLR_rankscore|MetaLR_score|MetaRNN_pred|MetaRNN_rankscore|MetaRNN_score|MetaSVM_pred|MetaSVM_rankscore|MetaSVM_score|MutPred_AAchange|MutPred_Top5features|MutPred_protID|MutPred_rankscore|MutPred_score|MutationAssessor_pred|MutationAssessor_rankscore|MutationAssessor_score|MutationTaster_AAE|MutationTaster_converted_rankscore|MutationTaster_model|MutationTaster_pred|MutationTaster_score|PROVEAN_converted_rankscore|PROVEAN_pred|PROVEAN_score|Polyphen2_HDIV_pred|Polyphen2_HDIV_rankscore|Polyphen2_HDIV_score|Polyphen2_HVAR_pred|Polyphen2_HVAR_rankscore|Polyphen2_HVAR_score|PrimateAI_pred|PrimateAI_rankscore|PrimateAI_score|REVEL_rankscore|REVEL_score|Reliability_index|SIFT4G_converted_rankscore|SIFT4G_pred|SIFT4G_score|SIFT_converted_rankscore|SIFT_pred|SIFT_score|SiPhy_29way_logOdds|SiPhy_29way_logOdds_rankscore|SiPhy_29way_pi|TSL|TWINSUK_AC|TWINSUK_AF|UK10K_AC|UK10K_AF|Uniprot_acc|Uniprot_entry|VARITY_ER_LOO_rankscore|VARITY_ER_LOO_score|VARITY_ER_rankscore|VARITY_ER_score|VARITY_R_LOO_rankscore|VARITY_R_LOO_score|VARITY_R_rankscore|VARITY_R_score|VEP_canonical|VEST4_rankscore|VEST4_score|VindijiaNeandertal|aaalt|aapos|aaref|alt|bStatistic|bStatistic_converted_rankscore|cds_strand|chr|clinvar_MedGen_id|clinvar_OMIM_id|clinvar_Orphanet_id|clinvar_clnsig|clinvar_hgvs|clinvar_id|clinvar_review|clinvar_trait|clinvar_var_source|codon_degeneracy|codonpos|eQTLGen_cis_or_trans|eQTLGen_gene_id|eQTLGen_gene_symbol|eQTLGen_snp_id|fathmm-MKL_coding_group|fathmm-MKL_coding_pred|fathmm-MKL_coding_rankscore|fathmm-MKL_coding_score|fathmm-XF_coding_pred|fathmm-XF_coding_rankscore|fathmm-XF_coding_score|gMVP_rankscore|gMVP_score|genename|gnomAD_exomes_AC|gnomAD_exomes_AF|gnomAD_exomes_AFR_AC|gnomAD_exomes_AFR_AF|gnomAD_exomes_AFR_AN|gnomAD_exomes_AFR_nhomalt|gnomAD_exomes_AMR_AC|gnomAD_exomes_AMR_AF|gnomAD_exomes_AMR_AN|gnomAD_exomes_AMR_nhomalt|gnomAD_exomes_AN|gnomAD_exomes_ASJ_AC|gnomAD_exomes_ASJ_AF|gnomAD_exomes_ASJ_AN|gnomAD_exomes_ASJ_nhomalt|gnomAD_exomes_EAS_AC|gnomAD_exomes_EAS_AF|gnomAD_exomes_EAS_AN|gnomAD_exomes_EAS_nhomalt|gnomAD_exomes_FIN_AC|gnomAD_exomes_FIN_AF|gnomAD_exomes_FIN_AN|gnomAD_exomes_FIN_nhomalt|gnomAD_exomes_MID_AC|gnomAD_exomes_MID_AF|gnomAD_exomes_MID_AN|gnomAD_exomes_MID_nhomalt|gnomAD_exomes_NFE_AC|gnomAD_exomes_NFE_AF|gnomAD_exomes_NFE_AN|gnomAD_exomes_NFE_nhomalt|gnomAD_exomes_POPMAX_AC|gnomAD_exomes_POPMAX_AF|gnomAD_exomes_POPMAX_AN|gnomAD_exomes_POPMAX_nhomalt|gnomAD_exomes_SAS_AC|gnomAD_exomes_SAS_AF|gnomAD_exomes_SAS_AN|gnomAD_exomes_SAS_nhomalt|gnomAD_exomes_flag|gnomAD_exomes_nhomalt|gnomAD_exomes_non_ukb_AC|gnomAD_exomes_non_ukb_AF|gnomAD_exomes_non_ukb_AFR_AC|gnomAD_exomes_non_ukb_AFR_AF|gnomAD_exomes_non_ukb_AFR_AN|gnomAD_exomes_non_ukb_AFR_nhomalt|gnomAD_exomes_non_ukb_AMR_AC|gnomAD_exomes_non_ukb_AMR_AF|gnomAD_exomes_non_ukb_AMR_AN|gnomAD_exomes_non_ukb_AMR_nhomalt|gnomAD_exomes_non_ukb_AN|gnomAD_exomes_non_ukb_ASJ_AC|gnomAD_exomes_non_ukb_ASJ_AF|gnomAD_exomes_non_ukb_ASJ_AN|gnomAD_exomes_non_ukb_ASJ_nhomalt|gnomAD_exomes_non_ukb_EAS_AC|gnomAD_exomes_non_ukb_EAS_AF|gnomAD_exomes_non_ukb_EAS_AN|gnomAD_exomes_non_ukb_EAS_nhomalt|gnomAD_exomes_non_ukb_FIN_AC|gnomAD_exomes_non_ukb_FIN_AF|gnomAD_exomes_non_ukb_FIN_AN|gnomAD_exomes_non_ukb_FIN_nhomalt|gnomAD_exomes_non_ukb_MID_AC|gnomAD_exomes_non_ukb_MID_AF|gnomAD_exomes_non_ukb_MID_AN|gnomAD_exomes_non_ukb_MID_nhomalt|gnomAD_exomes_non_ukb_NFE_AC|gnomAD_exomes_non_ukb_NFE_AF|gnomAD_exomes_non_ukb_NFE_AN|gnomAD_exomes_non_ukb_NFE_nhomalt|gnomAD_exomes_non_ukb_SAS_AC|gnomAD_exomes_non_ukb_SAS_AF|gnomAD_exomes_non_ukb_SAS_AN|gnomAD_exomes_non_ukb_SAS_nhomalt|gnomAD_exomes_non_ukb_nhomalt|gnomAD_genomes_AC|gnomAD_genomes_AF|gnomAD_genomes_AFR_AC|gnomAD_genomes_AFR_AF|gnomAD_genomes_AFR_AN|gnomAD_genomes_AFR_nhomalt|gnomAD_genomes_AMI_AC|gnomAD_genomes_AMI_AF|gnomAD_genomes_AMI_AN|gnomAD_genomes_AMI_nhomalt|gnomAD_genomes_AMR_AC|gnomAD_genomes_AMR_AF|gnomAD_genomes_AMR_AN|gnomAD_genomes_AMR_nhomalt|gnomAD_genomes_AN|gnomAD_genomes_ASJ_AC|gnomAD_genomes_ASJ_AF|gnomAD_genomes_ASJ_AN|gnomAD_genomes_ASJ_nhomalt|gnomAD_genomes_EAS_AC|gnomAD_genomes_EAS_AF|gnomAD_genomes_EAS_AN|gnomAD_genomes_EAS_nhomalt|gnomAD_genomes_FIN_AC|gnomAD_genomes_FIN_AF|gnomAD_genomes_FIN_AN|gnomAD_genomes_FIN_nhomalt|gnomAD_genomes_MID_AC|gnomAD_genomes_MID_AF|gnomAD_genomes_MID_AN|gnomAD_genomes_MID_nhomalt|gnomAD_genomes_NFE_AC|gnomAD_genomes_NFE_AF|gnomAD_genomes_NFE_AN|gnomAD_genomes_NFE_nhomalt|gnomAD_genomes_POPMAX_AC|gnomAD_genomes_POPMAX_AF|gnomAD_genomes_POPMAX_AN|gnomAD_genomes_POPMAX_nhomalt|gnomAD_genomes_SAS_AC|gnomAD_genomes_SAS_AF|gnomAD_genomes_SAS_AN|gnomAD_genomes_SAS_nhomalt|gnomAD_genomes_flag|gnomAD_genomes_nhomalt|hg18_chr|hg18_pos(1-based)|hg19_chr|hg19_pos(1-based)|integrated_confidence_value|integrated_fitCons_rankscore|integrated_fitCons_score|phastCons100way_vertebrate|phastCons100way_vertebrate_rankscore|phastCons17way_primate|phastCons17way_primate_rankscore|phastCons470way_mammalian|phastCons470way_mammalian_rankscore|phyloP100way_vertebrate|phyloP100way_vertebrate_rankscore|phyloP17way_primate|phyloP17way_primate_rankscore|phyloP470way_mammalian|phyloP470way_mammalian_rankscore|pos(1-based)|ref|refcodon|rs_dbSNP|MES-NCSS_downstream_acceptor|MES-NCSS_downstream_donor|MES-NCSS_upstream_acceptor|MES-NCSS_upstream_donor|MES-SWA_acceptor_alt|MES-SWA_acceptor_diff|MES-SWA_acceptor_ref|MES-SWA_acceptor_ref_comp|MES-SWA_donor_alt|MES-SWA_donor_diff|MES-SWA_donor_ref|MES-SWA_donor_ref_comp|MaxEntScan_alt|MaxEntScan_diff|MaxEntScan_ref|ada_score|rf_score|SpliceAI_pred_DP_AG|SpliceAI_pred_DP_AL|SpliceAI_pred_DP_DG|SpliceAI_pred_DP_DL|SpliceAI_pred_DS_AG|SpliceAI_pred_DS_AL|SpliceAI_pred_DS_DG|SpliceAI_pred_DS_DL|SpliceAI_pred_SYMBOL|HGMD_2020|HGMD_2020_brief|HGMD_2020_pmid|HGMD_2020_disease|HGMD_2020_hgvs|HGMD_2020_confidence|HGMD_2020_variant_type|clinvar_20250120|clinvar_20250120_ALLELEID|clinvar_20250120_CLNSIG|clinvar_20250120_CLNDN|clinvar_20250120_CLNDNINCL|clinvar_20250120_CLNDISDB|clinvar_20250120_CLNDISDBINCL|clinvar_20250120_CLNHGVS|clinvar_20250120_CLNREVSTAT|clinvar_20250120_CLNSIGCONF|clinvar_20250120_CLNSIGINCL|clinvar_20250120_CLNVC|clinvar_20250120_CLNVCSO|clinvar_20250120_CLNVI|clinvar_20250120_DBVARID|clinvar_20250120_ORIGIN|clinvar_20250120_RS|clinvar_20250120_SSR">

2.) now my idea is to use the annotation names (and not array numbers) for the code in ‘annotate_rows’. I am getting an array of annotation names since I will split the CSQ by “|”.
This is the best, chatGPT came up with

# importing pandas
import pandas as pd

# read text file into pandas DataFrame
#data = pd.read_csv("/mnt/users/haas/projects/hail/mt/lhp.2024-07-03_vep_anno.txt")

# Text data to be read
data = """Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|UNIPROT_ISOFORM|REFSEQ_MATCH|SOURCE|REFSEQ_OFFSET|GENE_PHENO|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|gnomADg_AF|gnomADg_AFR_AF|gnomADg_AMI_AF|gnomADg_AMR_AF|gnomADg_ASJ_AF|gnomADg_EAS_AF|gnomADg_FIN_AF|gnomADg_MID_AF|gnomADg_NFE_AF|gnomADg_OTH_AF|gnomADg_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|1000Gp3_AC|1000Gp3_AF|1000Gp3_AFR_AC|1000Gp3_AFR_AF|1000Gp3_AMR_AC|1000Gp3_AMR_AF|1000Gp3_EAS_AC|1000Gp3_EAS_AF|1000Gp3_EUR_AC|1000Gp3_EUR_AF|1000Gp3_SAS_AC|1000Gp3_SAS_AF|ALFA_African_AC|ALFA_African_AF|ALFA_African_AN|ALFA_African_American_AC|ALFA_African_American_AF|ALFA_African_American_AN|ALFA_African_Others_AC|ALFA_African_Others_AF|ALFA_African_Others_AN|ALFA_Asian_AC|ALFA_Asian_AF|ALFA_Asian_AN|ALFA_East_Asian_AC|ALFA_East_Asian_AF|ALFA_East_Asian_AN|ALFA_European_AC|ALFA_European_AF|ALFA_European_AN|ALFA_Latin_American_1_AC|ALFA_Latin_American_1_AF|ALFA_Latin_American_1_AN|ALFA_Latin_American_2_AC|ALFA_Latin_American_2_AF|ALFA_Latin_American_2_AN|ALFA_Other_AC|ALFA_Other_AF|ALFA_Other_AN|ALFA_Other_Asian_AC|ALFA_Other_Asian_AF|ALFA_Other_Asian_AN|ALFA_South_Asian_AC|ALFA_South_Asian_AF|ALFA_South_Asian_AN|ALFA_Total_AC|ALFA_Total_AF|ALFA_Total_AN|ALSPAC_AC|ALSPAC_AF|APPRIS|Aloft_Confidence|Aloft_Fraction_transcripts_affected|Aloft_pred|Aloft_prob_Dominant|Aloft_prob_Recessive|Aloft_prob_Tolerant|AlphaMissense_pred|AlphaMissense_rankscore|AlphaMissense_score|AltaiNeandertal|Ancestral_allele|BayesDel_addAF_pred|BayesDel_addAF_rankscore|BayesDel_addAF_score|BayesDel_noAF_pred|BayesDel_noAF_rankscore|BayesDel_noAF_score|CADD_phred|CADD_phred_hg19|CADD_raw|CADD_raw_hg19|CADD_raw_rankscore|CADD_raw_rankscore_hg19|ChagyrskayaNeandertal|ClinPred_pred|ClinPred_rankscore|ClinPred_score|DANN_rankscore|DANN_score|DEOGEN2_pred|DEOGEN2_rankscore|DEOGEN2_score|Denisova|ESM1b_pred|ESM1b_rankscore|ESM1b_score|ESP6500_AA_AC|ESP6500_AA_AF|ESP6500_EA_AC|ESP6500_EA_AF|EVE_Class10_pred|EVE_Class20_pred|EVE_Class25_pred|EVE_Class30_pred|EVE_Class40_pred|EVE_Class50_pred|EVE_Class60_pred|EVE_Class70_pred|EVE_Class75_pred|EVE_Class80_pred|EVE_Class90_pred|EVE_rankscore|EVE_score|Eigen-PC-phred_coding|Eigen-PC-raw_coding|Eigen-PC-raw_coding_rankscore|Eigen-phred_coding|Eigen-raw_coding|Eigen-raw_coding_rankscore|Ensembl_geneid|Ensembl_proteinid|Ensembl_transcriptid|ExAC_AC|ExAC_AF|ExAC_AFR_AC|ExAC_AFR_AF|ExAC_AMR_AC|ExAC_AMR_AF|ExAC_Adj_AC|ExAC_Adj_AF|ExAC_EAS_AC|ExAC_EAS_AF|ExAC_FIN_AC|ExAC_FIN_AF|ExAC_NFE_AC|ExAC_NFE_AF|ExAC_SAS_AC|ExAC_SAS_AF|ExAC_nonTCGA_AC|ExAC_nonTCGA_AF|ExAC_nonTCGA_AFR_AC|ExAC_nonTCGA_AFR_AF|ExAC_nonTCGA_AMR_AC|ExAC_nonTCGA_AMR_AF|ExAC_nonTCGA_Adj_AC|ExAC_nonTCGA_Adj_AF|ExAC_nonTCGA_EAS_AC|ExAC_nonTCGA_EAS_AF|ExAC_nonTCGA_FIN_AC|ExAC_nonTCGA_FIN_AF|ExAC_nonTCGA_NFE_AC|ExAC_nonTCGA_NFE_AF|ExAC_nonTCGA_SAS_AC|ExAC_nonTCGA_SAS_AF|ExAC_nonpsych_AC|ExAC_nonpsych_AF|ExAC_nonpsych_AFR_AC|ExAC_nonpsych_AFR_AF|ExAC_nonpsych_AMR_AC|ExAC_nonpsych_AMR_AF|ExAC_nonpsych_Adj_AC|ExAC_nonpsych_Adj_AF|ExAC_nonpsych_EAS_AC|ExAC_nonpsych_EAS_AF|ExAC_nonpsych_FIN_AC|ExAC_nonpsych_FIN_AF|ExAC_nonpsych_NFE_AC|ExAC_nonpsych_NFE_AF|ExAC_nonpsych_SAS_AC|ExAC_nonpsych_SAS_AF|FATHMM_converted_rankscore|FATHMM_pred|FATHMM_score|GENCODE_basic|GERP++_NR|GERP++_RS|GERP++_RS_rankscore|GM12878_confidence_value|GM12878_fitCons_rankscore|GM12878_fitCons_score|GTEx_V8_eQTL_gene|GTEx_V8_eQTL_tissue|GTEx_V8_sQTL_gene|GTEx_V8_sQTL_tissue|GenoCanyon_rankscore|GenoCanyon_score|Geuvadis_eQTL_target_gene|H1-hESC_confidence_value|H1-hESC_fitCons_rankscore|H1-hESC_fitCons_score|HGVSc_ANNOVAR|HGVSc_VEP|HGVSc_snpEff|HGVSp_ANNOVAR|HGVSp_VEP|HGVSp_snpEff|HUVEC_confidence_value|HUVEC_fitCons_rankscore|HUVEC_fitCons_score|Interpro_domain|LINSIGHT|LINSIGHT_rankscore|LIST-S2_pred|LIST-S2_rankscore|LIST-S2_score|LRT_Omega|LRT_converted_rankscore|LRT_pred|LRT_score|M-CAP_pred|M-CAP_rankscore|M-CAP_score|MPC_rankscore|MPC_score|MVP_rankscore|MVP_score|MetaLR_pred|MetaLR_rankscore|MetaLR_score|MetaRNN_pred|MetaRNN_rankscore|MetaRNN_score|MetaSVM_pred|MetaSVM_rankscore|MetaSVM_score|MutPred_AAchange|MutPred_Top5features|MutPred_protID|MutPred_rankscore|MutPred_score|MutationAssessor_pred|MutationAssessor_rankscore|MutationAssessor_score|MutationTaster_AAE|MutationTaster_converted_rankscore|MutationTaster_model|MutationTaster_pred|MutationTaster_score|PROVEAN_converted_rankscore|PROVEAN_pred|PROVEAN_score|Polyphen2_HDIV_pred|Polyphen2_HDIV_rankscore|Polyphen2_HDIV_score|Polyphen2_HVAR_pred|Polyphen2_HVAR_rankscore|Polyphen2_HVAR_score|PrimateAI_pred|PrimateAI_rankscore|PrimateAI_score|REVEL_rankscore|REVEL_score|Reliability_index|SIFT4G_converted_rankscore|SIFT4G_pred|SIFT4G_score|SIFT_converted_rankscore|SIFT_pred|SIFT_score|SiPhy_29way_logOdds|SiPhy_29way_logOdds_rankscore|SiPhy_29way_pi|TSL|TWINSUK_AC|TWINSUK_AF|UK10K_AC|UK10K_AF|Uniprot_acc|Uniprot_entry|VARITY_ER_LOO_rankscore|VARITY_ER_LOO_score|VARITY_ER_rankscore|VARITY_ER_score|VARITY_R_LOO_rankscore|VARITY_R_LOO_score|VARITY_R_rankscore|VARITY_R_score|VEP_canonical|VEST4_rankscore|VEST4_score|VindijiaNeandertal|aaalt|aapos|aaref|alt|bStatistic|bStatistic_converted_rankscore|cds_strand|chr|clinvar_MedGen_id|clinvar_OMIM_id|clinvar_Orphanet_id|clinvar_clnsig|clinvar_hgvs|clinvar_id|clinvar_review|clinvar_trait|clinvar_var_source|codon_degeneracy|codonpos|eQTLGen_cis_or_trans|eQTLGen_gene_id|eQTLGen_gene_symbol|eQTLGen_snp_id|fathmm-MKL_coding_group|fathmm-MKL_coding_pred|fathmm-MKL_coding_rankscore|fathmm-MKL_coding_score|fathmm-XF_coding_pred|fathmm-XF_coding_rankscore|fathmm-XF_coding_score|gMVP_rankscore|gMVP_score|genename|gnomAD_exomes_AC|gnomAD_exomes_AF|gnomAD_exomes_AFR_AC|gnomAD_exomes_AFR_AF|gnomAD_exomes_AFR_AN|gnomAD_exomes_AFR_nhomalt|gnomAD_exomes_AMR_AC|gnomAD_exomes_AMR_AF|gnomAD_exomes_AMR_AN|gnomAD_exomes_AMR_nhomalt|gnomAD_exomes_AN|gnomAD_exomes_ASJ_AC|gnomAD_exomes_ASJ_AF|gnomAD_exomes_ASJ_AN|gnomAD_exomes_ASJ_nhomalt|gnomAD_exomes_EAS_AC|gnomAD_exomes_EAS_AF|gnomAD_exomes_EAS_AN|gnomAD_exomes_EAS_nhomalt|gnomAD_exomes_FIN_AC|gnomAD_exomes_FIN_AF|gnomAD_exomes_FIN_AN|gnomAD_exomes_FIN_nhomalt|gnomAD_exomes_MID_AC|gnomAD_exomes_MID_AF|gnomAD_exomes_MID_AN|gnomAD_exomes_MID_nhomalt|gnomAD_exomes_NFE_AC|gnomAD_exomes_NFE_AF|gnomAD_exomes_NFE_AN|gnomAD_exomes_NFE_nhomalt|gnomAD_exomes_POPMAX_AC|gnomAD_exomes_POPMAX_AF|gnomAD_exomes_POPMAX_AN|gnomAD_exomes_POPMAX_nhomalt|gnomAD_exomes_SAS_AC|gnomAD_exomes_SAS_AF|gnomAD_exomes_SAS_AN|gnomAD_exomes_SAS_nhomalt|gnomAD_exomes_flag|gnomAD_exomes_nhomalt|gnomAD_exomes_non_ukb_AC|gnomAD_exomes_non_ukb_AF|gnomAD_exomes_non_ukb_AFR_AC|gnomAD_exomes_non_ukb_AFR_AF|gnomAD_exomes_non_ukb_AFR_AN|gnomAD_exomes_non_ukb_AFR_nhomalt|gnomAD_exomes_non_ukb_AMR_AC|gnomAD_exomes_non_ukb_AMR_AF|gnomAD_exomes_non_ukb_AMR_AN|gnomAD_exomes_non_ukb_AMR_nhomalt|gnomAD_exomes_non_ukb_AN|gnomAD_exomes_non_ukb_ASJ_AC|gnomAD_exomes_non_ukb_ASJ_AF|gnomAD_exomes_non_ukb_ASJ_AN|gnomAD_exomes_non_ukb_ASJ_nhomalt|gnomAD_exomes_non_ukb_EAS_AC|gnomAD_exomes_non_ukb_EAS_AF|gnomAD_exomes_non_ukb_EAS_AN|gnomAD_exomes_non_ukb_EAS_nhomalt|gnomAD_exomes_non_ukb_FIN_AC|gnomAD_exomes_non_ukb_FIN_AF|gnomAD_exomes_non_ukb_FIN_AN|gnomAD_exomes_non_ukb_FIN_nhomalt|gnomAD_exomes_non_ukb_MID_AC|gnomAD_exomes_non_ukb_MID_AF|gnomAD_exomes_non_ukb_MID_AN|gnomAD_exomes_non_ukb_MID_nhomalt|gnomAD_exomes_non_ukb_NFE_AC|gnomAD_exomes_non_ukb_NFE_AF|gnomAD_exomes_non_ukb_NFE_AN|gnomAD_exomes_non_ukb_NFE_nhomalt|gnomAD_exomes_non_ukb_SAS_AC|gnomAD_exomes_non_ukb_SAS_AF|gnomAD_exomes_non_ukb_SAS_AN|gnomAD_exomes_non_ukb_SAS_nhomalt|gnomAD_exomes_non_ukb_nhomalt|gnomAD_genomes_AC|gnomAD_genomes_AF|gnomAD_genomes_AFR_AC|gnomAD_genomes_AFR_AF|gnomAD_genomes_AFR_AN|gnomAD_genomes_AFR_nhomalt|gnomAD_genomes_AMI_AC|gnomAD_genomes_AMI_AF|gnomAD_genomes_AMI_AN|gnomAD_genomes_AMI_nhomalt|gnomAD_genomes_AMR_AC|gnomAD_genomes_AMR_AF|gnomAD_genomes_AMR_AN|gnomAD_genomes_AMR_nhomalt|gnomAD_genomes_AN|gnomAD_genomes_ASJ_AC|gnomAD_genomes_ASJ_AF|gnomAD_genomes_ASJ_AN|gnomAD_genomes_ASJ_nhomalt|gnomAD_genomes_EAS_AC|gnomAD_genomes_EAS_AF|gnomAD_genomes_EAS_AN|gnomAD_genomes_EAS_nhomalt|gnomAD_genomes_FIN_AC|gnomAD_genomes_FIN_AF|gnomAD_genomes_FIN_AN|gnomAD_genomes_FIN_nhomalt|gnomAD_genomes_MID_AC|gnomAD_genomes_MID_AF|gnomAD_genomes_MID_AN|gnomAD_genomes_MID_nhomalt|gnomAD_genomes_NFE_AC|gnomAD_genomes_NFE_AF|gnomAD_genomes_NFE_AN|gnomAD_genomes_NFE_nhomalt|gnomAD_genomes_POPMAX_AC|gnomAD_genomes_POPMAX_AF|gnomAD_genomes_POPMAX_AN|gnomAD_genomes_POPMAX_nhomalt|gnomAD_genomes_SAS_AC|gnomAD_genomes_SAS_AF|gnomAD_genomes_SAS_AN|gnomAD_genomes_SAS_nhomalt|gnomAD_genomes_flag|gnomAD_genomes_nhomalt|hg18_chr|hg18_pos(1-based)|hg19_chr|hg19_pos(1-based)|integrated_confidence_value|integrated_fitCons_rankscore|integrated_fitCons_score|phastCons100way_vertebrate|phastCons100way_vertebrate_rankscore|phastCons17way_primate|phastCons17way_primate_rankscore|phastCons470way_mammalian|phastCons470way_mammalian_rankscore|phyloP100way_vertebrate|phyloP100way_vertebrate_rankscore|phyloP17way_primate|phyloP17way_primate_rankscore|phyloP470way_mammalian|phyloP470way_mammalian_rankscore|pos(1-based)|ref|refcodon|rs_dbSNP|MES-NCSS_downstream_acceptor|MES-NCSS_downstream_donor|MES-NCSS_upstream_acceptor|MES-NCSS_upstream_donor|MES-SWA_acceptor_alt|MES-SWA_acceptor_diff|MES-SWA_acceptor_ref|MES-SWA_acceptor_ref_comp|MES-SWA_donor_alt|MES-SWA_donor_diff|MES-SWA_donor_ref|MES-SWA_donor_ref_comp|MaxEntScan_alt|MaxEntScan_diff|MaxEntScan_ref|ada_score|rf_score|SpliceAI_pred_DP_AG|SpliceAI_pred_DP_AL|SpliceAI_pred_DP_DG|SpliceAI_pred_DP_DL|SpliceAI_pred_DS_AG|SpliceAI_pred_DS_AL|SpliceAI_pred_DS_DG|SpliceAI_pred_DS_DL|SpliceAI_pred_SYMBOL|HGMD_2020|HGMD_2020_brief|HGMD_2020_pmid|HGMD_2020_disease|HGMD_2020_hgvs|HGMD_2020_confidence|HGMD_2020_variant_type|clinvar_20250120|clinvar_20250120_ALLELEID|clinvar_20250120_CLNSIG|clinvar_20250120_CLNDN|clinvar_20250120_CLNDNINCL|clinvar_20250120_CLNDISDB|clinvar_20250120_CLNDISDBINCL|clinvar_20250120_CLNHGVS|clinvar_20250120_CLNREVSTAT|clinvar_20250120_CLNSIGCONF|clinvar_20250120_CLNSIGINCL|clinvar_20250120_CLNVC|clinvar_20250120_CLNVCSO|clinvar_20250120_CLNVI|clinvar_20250120_DBVARID|clinvar_20250120_ORIGIN|clinvar_20250120_RS|clinvar_20250120_SSR">"""

# Split the text by "|" and create a list of rows
rows = [line.split("|") for line in data.split("\n")]

# Create a DataFrame from the rows
df = pd.DataFrame(rows)

# Add a second column with row numbers
df["RowNumber"] = df.index + 1


# Transpose the DataFrame using the T attribute
transposed_df = df.T

# Reset the index to move row names into a regular column
df_reset = transposed_df.reset_index()

# Rename the column containing the row names to "name"
df_reset = df_reset.rename(columns={'index': 'index_count'})

# Rename the second column to "VEP"
df_reset = df_reset.rename(columns={df_reset.columns[1]: "VEP"})

# Using + operator to combine two columns into the column "annotations"
df_reset["annotations"] = df_reset['VEP'].astype(str) +" = "+ df_reset["index_count"].astype(str)

# Filter rows based on values in the "VEP" column using the vector of annotations
annotations = ['IMPACT', 'Consequence','SYMBOL','Gene','CANONICAL','MANE_SELECT','MANE_PLUS_CLINICAL','HGVSc','HGVSp','CADD_phred','CADD_raw','CADD_raw_rankscore',
           'gnomADe_AF','gnomADe_NFE_AF','gnomADg_AF','gnomADg_NFE_AF',
           'SpliceAI_pred_DP_AG','SpliceAI_pred_DP_AL','SpliceAI_pred_DP_DG','SpliceAI_pred_DP_DL','SpliceAI_pred_DS_AG','SpliceAI_pred_DS_AL','SpliceAI_pred_DS_DG','SpliceAI_pred_DS_DL','SpliceAI_pred_SYMBOL','AlphaMissense_pred','AlphaMissense_rankscore','AlphaMissense_score','REVEL_rankscore','REVEL_score','Clinvar_CLNSIG','Clinvar_CLNSIGCONF','clinvar_clnsig','REVEL_rankscore','REVEL_score','Reliability_index','MaxEntScan_alt','MaxEntScan_diff','MaxEntScan_ref']

filtered_df = df_reset[df_reset['VEP'].isin(annotations)]

# Get the values of the 'annotations' column and print without index numbers
for value in filtered_df['annotations']:
    print(value)

3.) now I can define variables of interest:

Consequence = 1
IMPACT = 2
SYMBOL = 3
Gene = 4
HGVSc = 10
HGVSp = 11

and then split the CSQ info from VEP:

split_csq = mt.info.CSQ
split_csq = split_csq.map(lambda x: x.split('\\|'))

mt = mt.annotate_rows(split_csq = split_csq)

mt = mt.explode_rows(mt.split_csq)

4.) and finally do the annotate_rows:

mt = mt.annotate_rows(
    IMPACT = mt.split_csq[IMPACT],
    Consequence = mt.split_csq[Consequence],
    SYMBOL = mt.split_csq[SYMBOL],
    Gene = mt.split_csq[Gene],
    HGVSc = mt.split_csq[HGVSc],
    HGVSp = mt.split_csq[HGVSp]
)

Any hint on also improving this part is highly appreciated.

patrick-schultz · July 3, 2025, 5:03pm

Thanks, I think I’m mostly following now. But I’m still unclear on what the end goal is. You say

I’m not sure what format the gnomad utils expect. Could you elaborate on how you want the data to look, maybe using the example from your first post?

I’m also a bit confused by the last steps in your last post. You explode on mt.split_csq, so after that split_csq will be a single string, not an array any more. But then in step 4 you are indexing into split_csq. I think maybe you don’t want to explode, but I can be of more help if you explain what you want in the end.

johnnyr · July 4, 2025, 2:21pm

Main goal is to reduce the table from the first post to only one row based on certain conditions, so that variants are unique. Probably creating a new variantID in the format chr:pos:ref:alt would be best to be specific. Then I would need a condition that selects the most severe consequence for this variantID based on a certain order e.g. non_coding_transcript_exon_variant would be first choice, synonymous_variant second and upstream_gene_variant third. Still this could leed to multiple transcripts with the same consequence. In this case I have another row-field ‘score’ which I could use for prioritization. Could that somehow work with ‘group_by’ (variant ID)?

patrick-schultz · July 8, 2025, 2:10pm

There are still a couple of things I don’t understand from your post with the code snippets. In 3), what is mt? Can you share the code that defines it? Is it just importing a VCF? Or is it annotating an existing matrix table with an annotation in a VCF? I’m trying to understand how you came to have multiple rows with the same locus/alleles in the first place.

And second, did you actually run 4)? It looks to me like that would error, since after mt = mt.explode_rows(mt.split_csq), the mt.split_csq field is no longer an array, so indexing into it like mt.split_csq[IMPACT] would be an error. Given that, I can’t tell what those steps are meant to accomplish, or if this is the cause of the duplicated variants.

Topic		Replies	Views
How to parse CSQ (VEP) field inside Hail 0.2? Hail Query & hailctl	17	2015	February 14, 2023
Best strategy for annotating and filtering VCF files using HAIL-VEP on UKB RAP? Hail Query & hailctl	3	1535	June 17, 2022
FYI: VEP annotation inconsistencies gnomad.utils Science	0	680	April 28, 2020
Filtering a MatrixTable with VEP annotations Hail Query & hailctl	0	347	August 10, 2020
Parsing VEP output Help [0.1]	5	2169	September 22, 2016

Most severe VEP consequence

Related topics