Most severe VEP consequence

I have annotated a VCF with VEP outside of hail and have used ‘annotate_rows’ in hail to get the desired annotation.
So far so good, but due to this, I cannot use e.g. gnomad.utils (or at least don’t know how) to easily annotate and filter for ‘most_severe_consequences’, to keep only one annotation per locus.
I know, this can be done while using VEP, but for some analyses, I’d like to keep everything.
Any suggestions how to achieve this.

Similarly, I have annotated a ‘score’ based on some conditions. Could I alternatively use this, to e.g. select the transcript with the lowest score per locus ?

ocus alleles Consequence HGVSc
locus array str str
chr1:156081600 [“C”,“T”] “upstream_gene_variant” “”
chr1:156081600 [“C”,“T”] “non_coding_transcript_exon_variant” “ENST00000442784.1:n.105G>A”
chr1:156081600 [“C”,“T”] “upstream_gene_variant” “”
chr1:156081600 [“C”,“T”] “upstream_gene_variant” “”
chr1:156081600 [“C”,“T”] “upstream_gene_variant” “”
chr1:156081600 [“C”,“T”] “synonymous_variant” “ENST00000532414.3:c.399G>A”
chr1:156081600 [“C”,“T”] “upstream_gene_variant” “”
chr1:156081600 [“C”,“T”] “upstream_gene_variant” “”
chr1:156081600 [“C”,“T”] “upstream_gene_variant” “”
chr1:156081600 [“C”,“T”] “synonymous_variant” “NM_001093725.2:c.399G>A”

Hi @johnnyr, I’m not sure I understand what you want to achieve. Is the issue that there are multiple rows for the same locus and alleles? Can you share the annotate_rows code you used to add the annotations?

Thanks for the quick response. Maybe I share it from the beginning. It might not be the most efficient or clean code, but for getting the data into hail it works fine:

1.) I make sure to get the right structure of the VEP annotation:

!cat myvcf | grep "##" | grep "CSQ" > myvcf.header.txt

So I am getting something like this in the “myvcf.header.txt”:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|UNIPROT_ISOFORM|REFSEQ_MATCH|SOURCE|REFSEQ_OFFSET|GENE_PHENO|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|gnomADg_AF|gnomADg_AFR_AF|gnomADg_AMI_AF|gnomADg_AMR_AF|gnomADg_ASJ_AF|gnomADg_EAS_AF|gnomADg_FIN_AF|gnomADg_MID_AF|gnomADg_NFE_AF|gnomADg_OTH_AF|gnomADg_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|1000Gp3_AC|1000Gp3_AF|1000Gp3_AFR_AC|1000Gp3_AFR_AF|1000Gp3_AMR_AC|1000Gp3_AMR_AF|1000Gp3_EAS_AC|1000Gp3_EAS_AF|1000Gp3_EUR_AC|1000Gp3_EUR_AF|1000Gp3_SAS_AC|1000Gp3_SAS_AF|ALFA_African_AC|ALFA_African_AF|ALFA_African_AN|ALFA_African_American_AC|ALFA_African_American_AF|ALFA_African_American_AN|ALFA_African_Others_AC|ALFA_African_Others_AF|ALFA_African_Others_AN|ALFA_Asian_AC|ALFA_Asian_AF|ALFA_Asian_AN|ALFA_East_Asian_AC|ALFA_East_Asian_AF|ALFA_East_Asian_AN|ALFA_European_AC|ALFA_European_AF|ALFA_European_AN|ALFA_Latin_American_1_AC|ALFA_Latin_American_1_AF|ALFA_Latin_American_1_AN|ALFA_Latin_American_2_AC|ALFA_Latin_American_2_AF|ALFA_Latin_American_2_AN|ALFA_Other_AC|ALFA_Other_AF|ALFA_Other_AN|ALFA_Other_Asian_AC|ALFA_Other_Asian_AF|ALFA_Other_Asian_AN|ALFA_South_Asian_AC|ALFA_South_Asian_AF|ALFA_South_Asian_AN|ALFA_Total_AC|ALFA_Total_AF|ALFA_Total_AN|ALSPAC_AC|ALSPAC_AF|APPRIS|Aloft_Confidence|Aloft_Fraction_transcripts_affected|Aloft_pred|Aloft_prob_Dominant|Aloft_prob_Recessive|Aloft_prob_Tolerant|AlphaMissense_pred|AlphaMissense_rankscore|AlphaMissense_score|AltaiNeandertal|Ancestral_allele|BayesDel_addAF_pred|BayesDel_addAF_rankscore|BayesDel_addAF_score|BayesDel_noAF_pred|BayesDel_noAF_rankscore|BayesDel_noAF_score|CADD_phred|CADD_phred_hg19|CADD_raw|CADD_raw_hg19|CADD_raw_rankscore|CADD_raw_rankscore_hg19|ChagyrskayaNeandertal|ClinPred_pred|ClinPred_rankscore|ClinPred_score|DANN_rankscore|DANN_score|DEOGEN2_pred|DEOGEN2_rankscore|DEOGEN2_score|Denisova|ESM1b_pred|ESM1b_rankscore|ESM1b_score|ESP6500_AA_AC|ESP6500_AA_AF|ESP6500_EA_AC|ESP6500_EA_AF|EVE_Class10_pred|EVE_Class20_pred|EVE_Class25_pred|EVE_Class30_pred|EVE_Class40_pred|EVE_Class50_pred|EVE_Class60_pred|EVE_Class70_pred|EVE_Class75_pred|EVE_Class80_pred|EVE_Class90_pred|EVE_rankscore|EVE_score|Eigen-PC-phred_coding|Eigen-PC-raw_coding|Eigen-PC-raw_coding_rankscore|Eigen-phred_coding|Eigen-raw_coding|Eigen-raw_coding_rankscore|Ensembl_geneid|Ensembl_proteinid|Ensembl_transcriptid|ExAC_AC|ExAC_AF|ExAC_AFR_AC|ExAC_AFR_AF|ExAC_AMR_AC|ExAC_AMR_AF|ExAC_Adj_AC|ExAC_Adj_AF|ExAC_EAS_AC|ExAC_EAS_AF|ExAC_FIN_AC|ExAC_FIN_AF|ExAC_NFE_AC|ExAC_NFE_AF|ExAC_SAS_AC|ExAC_SAS_AF|ExAC_nonTCGA_AC|ExAC_nonTCGA_AF|ExAC_nonTCGA_AFR_AC|ExAC_nonTCGA_AFR_AF|ExAC_nonTCGA_AMR_AC|ExAC_nonTCGA_AMR_AF|ExAC_nonTCGA_Adj_AC|ExAC_nonTCGA_Adj_AF|ExAC_nonTCGA_EAS_AC|ExAC_nonTCGA_EAS_AF|ExAC_nonTCGA_FIN_AC|ExAC_nonTCGA_FIN_AF|ExAC_nonTCGA_NFE_AC|ExAC_nonTCGA_NFE_AF|ExAC_nonTCGA_SAS_AC|ExAC_nonTCGA_SAS_AF|ExAC_nonpsych_AC|ExAC_nonpsych_AF|ExAC_nonpsych_AFR_AC|ExAC_nonpsych_AFR_AF|ExAC_nonpsych_AMR_AC|ExAC_nonpsych_AMR_AF|ExAC_nonpsych_Adj_AC|ExAC_nonpsych_Adj_AF|ExAC_nonpsych_EAS_AC|ExAC_nonpsych_EAS_AF|ExAC_nonpsych_FIN_AC|ExAC_nonpsych_FIN_AF|ExAC_nonpsych_NFE_AC|ExAC_nonpsych_NFE_AF|ExAC_nonpsych_SAS_AC|ExAC_nonpsych_SAS_AF|FATHMM_converted_rankscore|FATHMM_pred|FATHMM_score|GENCODE_basic|GERP++_NR|GERP++_RS|GERP++_RS_rankscore|GM12878_confidence_value|GM12878_fitCons_rankscore|GM12878_fitCons_score|GTEx_V8_eQTL_gene|GTEx_V8_eQTL_tissue|GTEx_V8_sQTL_gene|GTEx_V8_sQTL_tissue|GenoCanyon_rankscore|GenoCanyon_score|Geuvadis_eQTL_target_gene|H1-hESC_confidence_value|H1-hESC_fitCons_rankscore|H1-hESC_fitCons_score|HGVSc_ANNOVAR|HGVSc_VEP|HGVSc_snpEff|HGVSp_ANNOVAR|HGVSp_VEP|HGVSp_snpEff|HUVEC_confidence_value|HUVEC_fitCons_rankscore|HUVEC_fitCons_score|Interpro_domain|LINSIGHT|LINSIGHT_rankscore|LIST-S2_pred|LIST-S2_rankscore|LIST-S2_score|LRT_Omega|LRT_converted_rankscore|LRT_pred|LRT_score|M-CAP_pred|M-CAP_rankscore|M-CAP_score|MPC_rankscore|MPC_score|MVP_rankscore|MVP_score|MetaLR_pred|MetaLR_rankscore|MetaLR_score|MetaRNN_pred|MetaRNN_rankscore|MetaRNN_score|MetaSVM_pred|MetaSVM_rankscore|MetaSVM_score|MutPred_AAchange|MutPred_Top5features|MutPred_protID|MutPred_rankscore|MutPred_score|MutationAssessor_pred|MutationAssessor_rankscore|MutationAssessor_score|MutationTaster_AAE|MutationTaster_converted_rankscore|MutationTaster_model|MutationTaster_pred|MutationTaster_score|PROVEAN_converted_rankscore|PROVEAN_pred|PROVEAN_score|Polyphen2_HDIV_pred|Polyphen2_HDIV_rankscore|Polyphen2_HDIV_score|Polyphen2_HVAR_pred|Polyphen2_HVAR_rankscore|Polyphen2_HVAR_score|PrimateAI_pred|PrimateAI_rankscore|PrimateAI_score|REVEL_rankscore|REVEL_score|Reliability_index|SIFT4G_converted_rankscore|SIFT4G_pred|SIFT4G_score|SIFT_converted_rankscore|SIFT_pred|SIFT_score|SiPhy_29way_logOdds|SiPhy_29way_logOdds_rankscore|SiPhy_29way_pi|TSL|TWINSUK_AC|TWINSUK_AF|UK10K_AC|UK10K_AF|Uniprot_acc|Uniprot_entry|VARITY_ER_LOO_rankscore|VARITY_ER_LOO_score|VARITY_ER_rankscore|VARITY_ER_score|VARITY_R_LOO_rankscore|VARITY_R_LOO_score|VARITY_R_rankscore|VARITY_R_score|VEP_canonical|VEST4_rankscore|VEST4_score|VindijiaNeandertal|aaalt|aapos|aaref|alt|bStatistic|bStatistic_converted_rankscore|cds_strand|chr|clinvar_MedGen_id|clinvar_OMIM_id|clinvar_Orphanet_id|clinvar_clnsig|clinvar_hgvs|clinvar_id|clinvar_review|clinvar_trait|clinvar_var_source|codon_degeneracy|codonpos|eQTLGen_cis_or_trans|eQTLGen_gene_id|eQTLGen_gene_symbol|eQTLGen_snp_id|fathmm-MKL_coding_group|fathmm-MKL_coding_pred|fathmm-MKL_coding_rankscore|fathmm-MKL_coding_score|fathmm-XF_coding_pred|fathmm-XF_coding_rankscore|fathmm-XF_coding_score|gMVP_rankscore|gMVP_score|genename|gnomAD_exomes_AC|gnomAD_exomes_AF|gnomAD_exomes_AFR_AC|gnomAD_exomes_AFR_AF|gnomAD_exomes_AFR_AN|gnomAD_exomes_AFR_nhomalt|gnomAD_exomes_AMR_AC|gnomAD_exomes_AMR_AF|gnomAD_exomes_AMR_AN|gnomAD_exomes_AMR_nhomalt|gnomAD_exomes_AN|gnomAD_exomes_ASJ_AC|gnomAD_exomes_ASJ_AF|gnomAD_exomes_ASJ_AN|gnomAD_exomes_ASJ_nhomalt|gnomAD_exomes_EAS_AC|gnomAD_exomes_EAS_AF|gnomAD_exomes_EAS_AN|gnomAD_exomes_EAS_nhomalt|gnomAD_exomes_FIN_AC|gnomAD_exomes_FIN_AF|gnomAD_exomes_FIN_AN|gnomAD_exomes_FIN_nhomalt|gnomAD_exomes_MID_AC|gnomAD_exomes_MID_AF|gnomAD_exomes_MID_AN|gnomAD_exomes_MID_nhomalt|gnomAD_exomes_NFE_AC|gnomAD_exomes_NFE_AF|gnomAD_exomes_NFE_AN|gnomAD_exomes_NFE_nhomalt|gnomAD_exomes_POPMAX_AC|gnomAD_exomes_POPMAX_AF|gnomAD_exomes_POPMAX_AN|gnomAD_exomes_POPMAX_nhomalt|gnomAD_exomes_SAS_AC|gnomAD_exomes_SAS_AF|gnomAD_exomes_SAS_AN|gnomAD_exomes_SAS_nhomalt|gnomAD_exomes_flag|gnomAD_exomes_nhomalt|gnomAD_exomes_non_ukb_AC|gnomAD_exomes_non_ukb_AF|gnomAD_exomes_non_ukb_AFR_AC|gnomAD_exomes_non_ukb_AFR_AF|gnomAD_exomes_non_ukb_AFR_AN|gnomAD_exomes_non_ukb_AFR_nhomalt|gnomAD_exomes_non_ukb_AMR_AC|gnomAD_exomes_non_ukb_AMR_AF|gnomAD_exomes_non_ukb_AMR_AN|gnomAD_exomes_non_ukb_AMR_nhomalt|gnomAD_exomes_non_ukb_AN|gnomAD_exomes_non_ukb_ASJ_AC|gnomAD_exomes_non_ukb_ASJ_AF|gnomAD_exomes_non_ukb_ASJ_AN|gnomAD_exomes_non_ukb_ASJ_nhomalt|gnomAD_exomes_non_ukb_EAS_AC|gnomAD_exomes_non_ukb_EAS_AF|gnomAD_exomes_non_ukb_EAS_AN|gnomAD_exomes_non_ukb_EAS_nhomalt|gnomAD_exomes_non_ukb_FIN_AC|gnomAD_exomes_non_ukb_FIN_AF|gnomAD_exomes_non_ukb_FIN_AN|gnomAD_exomes_non_ukb_FIN_nhomalt|gnomAD_exomes_non_ukb_MID_AC|gnomAD_exomes_non_ukb_MID_AF|gnomAD_exomes_non_ukb_MID_AN|gnomAD_exomes_non_ukb_MID_nhomalt|gnomAD_exomes_non_ukb_NFE_AC|gnomAD_exomes_non_ukb_NFE_AF|gnomAD_exomes_non_ukb_NFE_AN|gnomAD_exomes_non_ukb_NFE_nhomalt|gnomAD_exomes_non_ukb_SAS_AC|gnomAD_exomes_non_ukb_SAS_AF|gnomAD_exomes_non_ukb_SAS_AN|gnomAD_exomes_non_ukb_SAS_nhomalt|gnomAD_exomes_non_ukb_nhomalt|gnomAD_genomes_AC|gnomAD_genomes_AF|gnomAD_genomes_AFR_AC|gnomAD_genomes_AFR_AF|gnomAD_genomes_AFR_AN|gnomAD_genomes_AFR_nhomalt|gnomAD_genomes_AMI_AC|gnomAD_genomes_AMI_AF|gnomAD_genomes_AMI_AN|gnomAD_genomes_AMI_nhomalt|gnomAD_genomes_AMR_AC|gnomAD_genomes_AMR_AF|gnomAD_genomes_AMR_AN|gnomAD_genomes_AMR_nhomalt|gnomAD_genomes_AN|gnomAD_genomes_ASJ_AC|gnomAD_genomes_ASJ_AF|gnomAD_genomes_ASJ_AN|gnomAD_genomes_ASJ_nhomalt|gnomAD_genomes_EAS_AC|gnomAD_genomes_EAS_AF|gnomAD_genomes_EAS_AN|gnomAD_genomes_EAS_nhomalt|gnomAD_genomes_FIN_AC|gnomAD_genomes_FIN_AF|gnomAD_genomes_FIN_AN|gnomAD_genomes_FIN_nhomalt|gnomAD_genomes_MID_AC|gnomAD_genomes_MID_AF|gnomAD_genomes_MID_AN|gnomAD_genomes_MID_nhomalt|gnomAD_genomes_NFE_AC|gnomAD_genomes_NFE_AF|gnomAD_genomes_NFE_AN|gnomAD_genomes_NFE_nhomalt|gnomAD_genomes_POPMAX_AC|gnomAD_genomes_POPMAX_AF|gnomAD_genomes_POPMAX_AN|gnomAD_genomes_POPMAX_nhomalt|gnomAD_genomes_SAS_AC|gnomAD_genomes_SAS_AF|gnomAD_genomes_SAS_AN|gnomAD_genomes_SAS_nhomalt|gnomAD_genomes_flag|gnomAD_genomes_nhomalt|hg18_chr|hg18_pos(1-based)|hg19_chr|hg19_pos(1-based)|integrated_confidence_value|integrated_fitCons_rankscore|integrated_fitCons_score|phastCons100way_vertebrate|phastCons100way_vertebrate_rankscore|phastCons17way_primate|phastCons17way_primate_rankscore|phastCons470way_mammalian|phastCons470way_mammalian_rankscore|phyloP100way_vertebrate|phyloP100way_vertebrate_rankscore|phyloP17way_primate|phyloP17way_primate_rankscore|phyloP470way_mammalian|phyloP470way_mammalian_rankscore|pos(1-based)|ref|refcodon|rs_dbSNP|MES-NCSS_downstream_acceptor|MES-NCSS_downstream_donor|MES-NCSS_upstream_acceptor|MES-NCSS_upstream_donor|MES-SWA_acceptor_alt|MES-SWA_acceptor_diff|MES-SWA_acceptor_ref|MES-SWA_acceptor_ref_comp|MES-SWA_donor_alt|MES-SWA_donor_diff|MES-SWA_donor_ref|MES-SWA_donor_ref_comp|MaxEntScan_alt|MaxEntScan_diff|MaxEntScan_ref|ada_score|rf_score|SpliceAI_pred_DP_AG|SpliceAI_pred_DP_AL|SpliceAI_pred_DP_DG|SpliceAI_pred_DP_DL|SpliceAI_pred_DS_AG|SpliceAI_pred_DS_AL|SpliceAI_pred_DS_DG|SpliceAI_pred_DS_DL|SpliceAI_pred_SYMBOL|HGMD_2020|HGMD_2020_brief|HGMD_2020_pmid|HGMD_2020_disease|HGMD_2020_hgvs|HGMD_2020_confidence|HGMD_2020_variant_type|clinvar_20250120|clinvar_20250120_ALLELEID|clinvar_20250120_CLNSIG|clinvar_20250120_CLNDN|clinvar_20250120_CLNDNINCL|clinvar_20250120_CLNDISDB|clinvar_20250120_CLNDISDBINCL|clinvar_20250120_CLNHGVS|clinvar_20250120_CLNREVSTAT|clinvar_20250120_CLNSIGCONF|clinvar_20250120_CLNSIGINCL|clinvar_20250120_CLNVC|clinvar_20250120_CLNVCSO|clinvar_20250120_CLNVI|clinvar_20250120_DBVARID|clinvar_20250120_ORIGIN|clinvar_20250120_RS|clinvar_20250120_SSR">

2.) now my idea is to use the annotation names (and not array numbers) for the code in ‘annotate_rows’. I am getting an array of annotation names since I will split the CSQ by “|”.
This is the best, chatGPT came up with :wink:

# importing pandas
import pandas as pd

# read text file into pandas DataFrame
#data = pd.read_csv("/mnt/users/haas/projects/hail/mt/lhp.2024-07-03_vep_anno.txt")

# Text data to be read
data = """Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|UNIPROT_ISOFORM|REFSEQ_MATCH|SOURCE|REFSEQ_OFFSET|GENE_PHENO|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|gnomADg_AF|gnomADg_AFR_AF|gnomADg_AMI_AF|gnomADg_AMR_AF|gnomADg_ASJ_AF|gnomADg_EAS_AF|gnomADg_FIN_AF|gnomADg_MID_AF|gnomADg_NFE_AF|gnomADg_OTH_AF|gnomADg_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|1000Gp3_AC|1000Gp3_AF|1000Gp3_AFR_AC|1000Gp3_AFR_AF|1000Gp3_AMR_AC|1000Gp3_AMR_AF|1000Gp3_EAS_AC|1000Gp3_EAS_AF|1000Gp3_EUR_AC|1000Gp3_EUR_AF|1000Gp3_SAS_AC|1000Gp3_SAS_AF|ALFA_African_AC|ALFA_African_AF|ALFA_African_AN|ALFA_African_American_AC|ALFA_African_American_AF|ALFA_African_American_AN|ALFA_African_Others_AC|ALFA_African_Others_AF|ALFA_African_Others_AN|ALFA_Asian_AC|ALFA_Asian_AF|ALFA_Asian_AN|ALFA_East_Asian_AC|ALFA_East_Asian_AF|ALFA_East_Asian_AN|ALFA_European_AC|ALFA_European_AF|ALFA_European_AN|ALFA_Latin_American_1_AC|ALFA_Latin_American_1_AF|ALFA_Latin_American_1_AN|ALFA_Latin_American_2_AC|ALFA_Latin_American_2_AF|ALFA_Latin_American_2_AN|ALFA_Other_AC|ALFA_Other_AF|ALFA_Other_AN|ALFA_Other_Asian_AC|ALFA_Other_Asian_AF|ALFA_Other_Asian_AN|ALFA_South_Asian_AC|ALFA_South_Asian_AF|ALFA_South_Asian_AN|ALFA_Total_AC|ALFA_Total_AF|ALFA_Total_AN|ALSPAC_AC|ALSPAC_AF|APPRIS|Aloft_Confidence|Aloft_Fraction_transcripts_affected|Aloft_pred|Aloft_prob_Dominant|Aloft_prob_Recessive|Aloft_prob_Tolerant|AlphaMissense_pred|AlphaMissense_rankscore|AlphaMissense_score|AltaiNeandertal|Ancestral_allele|BayesDel_addAF_pred|BayesDel_addAF_rankscore|BayesDel_addAF_score|BayesDel_noAF_pred|BayesDel_noAF_rankscore|BayesDel_noAF_score|CADD_phred|CADD_phred_hg19|CADD_raw|CADD_raw_hg19|CADD_raw_rankscore|CADD_raw_rankscore_hg19|ChagyrskayaNeandertal|ClinPred_pred|ClinPred_rankscore|ClinPred_score|DANN_rankscore|DANN_score|DEOGEN2_pred|DEOGEN2_rankscore|DEOGEN2_score|Denisova|ESM1b_pred|ESM1b_rankscore|ESM1b_score|ESP6500_AA_AC|ESP6500_AA_AF|ESP6500_EA_AC|ESP6500_EA_AF|EVE_Class10_pred|EVE_Class20_pred|EVE_Class25_pred|EVE_Class30_pred|EVE_Class40_pred|EVE_Class50_pred|EVE_Class60_pred|EVE_Class70_pred|EVE_Class75_pred|EVE_Class80_pred|EVE_Class90_pred|EVE_rankscore|EVE_score|Eigen-PC-phred_coding|Eigen-PC-raw_coding|Eigen-PC-raw_coding_rankscore|Eigen-phred_coding|Eigen-raw_coding|Eigen-raw_coding_rankscore|Ensembl_geneid|Ensembl_proteinid|Ensembl_transcriptid|ExAC_AC|ExAC_AF|ExAC_AFR_AC|ExAC_AFR_AF|ExAC_AMR_AC|ExAC_AMR_AF|ExAC_Adj_AC|ExAC_Adj_AF|ExAC_EAS_AC|ExAC_EAS_AF|ExAC_FIN_AC|ExAC_FIN_AF|ExAC_NFE_AC|ExAC_NFE_AF|ExAC_SAS_AC|ExAC_SAS_AF|ExAC_nonTCGA_AC|ExAC_nonTCGA_AF|ExAC_nonTCGA_AFR_AC|ExAC_nonTCGA_AFR_AF|ExAC_nonTCGA_AMR_AC|ExAC_nonTCGA_AMR_AF|ExAC_nonTCGA_Adj_AC|ExAC_nonTCGA_Adj_AF|ExAC_nonTCGA_EAS_AC|ExAC_nonTCGA_EAS_AF|ExAC_nonTCGA_FIN_AC|ExAC_nonTCGA_FIN_AF|ExAC_nonTCGA_NFE_AC|ExAC_nonTCGA_NFE_AF|ExAC_nonTCGA_SAS_AC|ExAC_nonTCGA_SAS_AF|ExAC_nonpsych_AC|ExAC_nonpsych_AF|ExAC_nonpsych_AFR_AC|ExAC_nonpsych_AFR_AF|ExAC_nonpsych_AMR_AC|ExAC_nonpsych_AMR_AF|ExAC_nonpsych_Adj_AC|ExAC_nonpsych_Adj_AF|ExAC_nonpsych_EAS_AC|ExAC_nonpsych_EAS_AF|ExAC_nonpsych_FIN_AC|ExAC_nonpsych_FIN_AF|ExAC_nonpsych_NFE_AC|ExAC_nonpsych_NFE_AF|ExAC_nonpsych_SAS_AC|ExAC_nonpsych_SAS_AF|FATHMM_converted_rankscore|FATHMM_pred|FATHMM_score|GENCODE_basic|GERP++_NR|GERP++_RS|GERP++_RS_rankscore|GM12878_confidence_value|GM12878_fitCons_rankscore|GM12878_fitCons_score|GTEx_V8_eQTL_gene|GTEx_V8_eQTL_tissue|GTEx_V8_sQTL_gene|GTEx_V8_sQTL_tissue|GenoCanyon_rankscore|GenoCanyon_score|Geuvadis_eQTL_target_gene|H1-hESC_confidence_value|H1-hESC_fitCons_rankscore|H1-hESC_fitCons_score|HGVSc_ANNOVAR|HGVSc_VEP|HGVSc_snpEff|HGVSp_ANNOVAR|HGVSp_VEP|HGVSp_snpEff|HUVEC_confidence_value|HUVEC_fitCons_rankscore|HUVEC_fitCons_score|Interpro_domain|LINSIGHT|LINSIGHT_rankscore|LIST-S2_pred|LIST-S2_rankscore|LIST-S2_score|LRT_Omega|LRT_converted_rankscore|LRT_pred|LRT_score|M-CAP_pred|M-CAP_rankscore|M-CAP_score|MPC_rankscore|MPC_score|MVP_rankscore|MVP_score|MetaLR_pred|MetaLR_rankscore|MetaLR_score|MetaRNN_pred|MetaRNN_rankscore|MetaRNN_score|MetaSVM_pred|MetaSVM_rankscore|MetaSVM_score|MutPred_AAchange|MutPred_Top5features|MutPred_protID|MutPred_rankscore|MutPred_score|MutationAssessor_pred|MutationAssessor_rankscore|MutationAssessor_score|MutationTaster_AAE|MutationTaster_converted_rankscore|MutationTaster_model|MutationTaster_pred|MutationTaster_score|PROVEAN_converted_rankscore|PROVEAN_pred|PROVEAN_score|Polyphen2_HDIV_pred|Polyphen2_HDIV_rankscore|Polyphen2_HDIV_score|Polyphen2_HVAR_pred|Polyphen2_HVAR_rankscore|Polyphen2_HVAR_score|PrimateAI_pred|PrimateAI_rankscore|PrimateAI_score|REVEL_rankscore|REVEL_score|Reliability_index|SIFT4G_converted_rankscore|SIFT4G_pred|SIFT4G_score|SIFT_converted_rankscore|SIFT_pred|SIFT_score|SiPhy_29way_logOdds|SiPhy_29way_logOdds_rankscore|SiPhy_29way_pi|TSL|TWINSUK_AC|TWINSUK_AF|UK10K_AC|UK10K_AF|Uniprot_acc|Uniprot_entry|VARITY_ER_LOO_rankscore|VARITY_ER_LOO_score|VARITY_ER_rankscore|VARITY_ER_score|VARITY_R_LOO_rankscore|VARITY_R_LOO_score|VARITY_R_rankscore|VARITY_R_score|VEP_canonical|VEST4_rankscore|VEST4_score|VindijiaNeandertal|aaalt|aapos|aaref|alt|bStatistic|bStatistic_converted_rankscore|cds_strand|chr|clinvar_MedGen_id|clinvar_OMIM_id|clinvar_Orphanet_id|clinvar_clnsig|clinvar_hgvs|clinvar_id|clinvar_review|clinvar_trait|clinvar_var_source|codon_degeneracy|codonpos|eQTLGen_cis_or_trans|eQTLGen_gene_id|eQTLGen_gene_symbol|eQTLGen_snp_id|fathmm-MKL_coding_group|fathmm-MKL_coding_pred|fathmm-MKL_coding_rankscore|fathmm-MKL_coding_score|fathmm-XF_coding_pred|fathmm-XF_coding_rankscore|fathmm-XF_coding_score|gMVP_rankscore|gMVP_score|genename|gnomAD_exomes_AC|gnomAD_exomes_AF|gnomAD_exomes_AFR_AC|gnomAD_exomes_AFR_AF|gnomAD_exomes_AFR_AN|gnomAD_exomes_AFR_nhomalt|gnomAD_exomes_AMR_AC|gnomAD_exomes_AMR_AF|gnomAD_exomes_AMR_AN|gnomAD_exomes_AMR_nhomalt|gnomAD_exomes_AN|gnomAD_exomes_ASJ_AC|gnomAD_exomes_ASJ_AF|gnomAD_exomes_ASJ_AN|gnomAD_exomes_ASJ_nhomalt|gnomAD_exomes_EAS_AC|gnomAD_exomes_EAS_AF|gnomAD_exomes_EAS_AN|gnomAD_exomes_EAS_nhomalt|gnomAD_exomes_FIN_AC|gnomAD_exomes_FIN_AF|gnomAD_exomes_FIN_AN|gnomAD_exomes_FIN_nhomalt|gnomAD_exomes_MID_AC|gnomAD_exomes_MID_AF|gnomAD_exomes_MID_AN|gnomAD_exomes_MID_nhomalt|gnomAD_exomes_NFE_AC|gnomAD_exomes_NFE_AF|gnomAD_exomes_NFE_AN|gnomAD_exomes_NFE_nhomalt|gnomAD_exomes_POPMAX_AC|gnomAD_exomes_POPMAX_AF|gnomAD_exomes_POPMAX_AN|gnomAD_exomes_POPMAX_nhomalt|gnomAD_exomes_SAS_AC|gnomAD_exomes_SAS_AF|gnomAD_exomes_SAS_AN|gnomAD_exomes_SAS_nhomalt|gnomAD_exomes_flag|gnomAD_exomes_nhomalt|gnomAD_exomes_non_ukb_AC|gnomAD_exomes_non_ukb_AF|gnomAD_exomes_non_ukb_AFR_AC|gnomAD_exomes_non_ukb_AFR_AF|gnomAD_exomes_non_ukb_AFR_AN|gnomAD_exomes_non_ukb_AFR_nhomalt|gnomAD_exomes_non_ukb_AMR_AC|gnomAD_exomes_non_ukb_AMR_AF|gnomAD_exomes_non_ukb_AMR_AN|gnomAD_exomes_non_ukb_AMR_nhomalt|gnomAD_exomes_non_ukb_AN|gnomAD_exomes_non_ukb_ASJ_AC|gnomAD_exomes_non_ukb_ASJ_AF|gnomAD_exomes_non_ukb_ASJ_AN|gnomAD_exomes_non_ukb_ASJ_nhomalt|gnomAD_exomes_non_ukb_EAS_AC|gnomAD_exomes_non_ukb_EAS_AF|gnomAD_exomes_non_ukb_EAS_AN|gnomAD_exomes_non_ukb_EAS_nhomalt|gnomAD_exomes_non_ukb_FIN_AC|gnomAD_exomes_non_ukb_FIN_AF|gnomAD_exomes_non_ukb_FIN_AN|gnomAD_exomes_non_ukb_FIN_nhomalt|gnomAD_exomes_non_ukb_MID_AC|gnomAD_exomes_non_ukb_MID_AF|gnomAD_exomes_non_ukb_MID_AN|gnomAD_exomes_non_ukb_MID_nhomalt|gnomAD_exomes_non_ukb_NFE_AC|gnomAD_exomes_non_ukb_NFE_AF|gnomAD_exomes_non_ukb_NFE_AN|gnomAD_exomes_non_ukb_NFE_nhomalt|gnomAD_exomes_non_ukb_SAS_AC|gnomAD_exomes_non_ukb_SAS_AF|gnomAD_exomes_non_ukb_SAS_AN|gnomAD_exomes_non_ukb_SAS_nhomalt|gnomAD_exomes_non_ukb_nhomalt|gnomAD_genomes_AC|gnomAD_genomes_AF|gnomAD_genomes_AFR_AC|gnomAD_genomes_AFR_AF|gnomAD_genomes_AFR_AN|gnomAD_genomes_AFR_nhomalt|gnomAD_genomes_AMI_AC|gnomAD_genomes_AMI_AF|gnomAD_genomes_AMI_AN|gnomAD_genomes_AMI_nhomalt|gnomAD_genomes_AMR_AC|gnomAD_genomes_AMR_AF|gnomAD_genomes_AMR_AN|gnomAD_genomes_AMR_nhomalt|gnomAD_genomes_AN|gnomAD_genomes_ASJ_AC|gnomAD_genomes_ASJ_AF|gnomAD_genomes_ASJ_AN|gnomAD_genomes_ASJ_nhomalt|gnomAD_genomes_EAS_AC|gnomAD_genomes_EAS_AF|gnomAD_genomes_EAS_AN|gnomAD_genomes_EAS_nhomalt|gnomAD_genomes_FIN_AC|gnomAD_genomes_FIN_AF|gnomAD_genomes_FIN_AN|gnomAD_genomes_FIN_nhomalt|gnomAD_genomes_MID_AC|gnomAD_genomes_MID_AF|gnomAD_genomes_MID_AN|gnomAD_genomes_MID_nhomalt|gnomAD_genomes_NFE_AC|gnomAD_genomes_NFE_AF|gnomAD_genomes_NFE_AN|gnomAD_genomes_NFE_nhomalt|gnomAD_genomes_POPMAX_AC|gnomAD_genomes_POPMAX_AF|gnomAD_genomes_POPMAX_AN|gnomAD_genomes_POPMAX_nhomalt|gnomAD_genomes_SAS_AC|gnomAD_genomes_SAS_AF|gnomAD_genomes_SAS_AN|gnomAD_genomes_SAS_nhomalt|gnomAD_genomes_flag|gnomAD_genomes_nhomalt|hg18_chr|hg18_pos(1-based)|hg19_chr|hg19_pos(1-based)|integrated_confidence_value|integrated_fitCons_rankscore|integrated_fitCons_score|phastCons100way_vertebrate|phastCons100way_vertebrate_rankscore|phastCons17way_primate|phastCons17way_primate_rankscore|phastCons470way_mammalian|phastCons470way_mammalian_rankscore|phyloP100way_vertebrate|phyloP100way_vertebrate_rankscore|phyloP17way_primate|phyloP17way_primate_rankscore|phyloP470way_mammalian|phyloP470way_mammalian_rankscore|pos(1-based)|ref|refcodon|rs_dbSNP|MES-NCSS_downstream_acceptor|MES-NCSS_downstream_donor|MES-NCSS_upstream_acceptor|MES-NCSS_upstream_donor|MES-SWA_acceptor_alt|MES-SWA_acceptor_diff|MES-SWA_acceptor_ref|MES-SWA_acceptor_ref_comp|MES-SWA_donor_alt|MES-SWA_donor_diff|MES-SWA_donor_ref|MES-SWA_donor_ref_comp|MaxEntScan_alt|MaxEntScan_diff|MaxEntScan_ref|ada_score|rf_score|SpliceAI_pred_DP_AG|SpliceAI_pred_DP_AL|SpliceAI_pred_DP_DG|SpliceAI_pred_DP_DL|SpliceAI_pred_DS_AG|SpliceAI_pred_DS_AL|SpliceAI_pred_DS_DG|SpliceAI_pred_DS_DL|SpliceAI_pred_SYMBOL|HGMD_2020|HGMD_2020_brief|HGMD_2020_pmid|HGMD_2020_disease|HGMD_2020_hgvs|HGMD_2020_confidence|HGMD_2020_variant_type|clinvar_20250120|clinvar_20250120_ALLELEID|clinvar_20250120_CLNSIG|clinvar_20250120_CLNDN|clinvar_20250120_CLNDNINCL|clinvar_20250120_CLNDISDB|clinvar_20250120_CLNDISDBINCL|clinvar_20250120_CLNHGVS|clinvar_20250120_CLNREVSTAT|clinvar_20250120_CLNSIGCONF|clinvar_20250120_CLNSIGINCL|clinvar_20250120_CLNVC|clinvar_20250120_CLNVCSO|clinvar_20250120_CLNVI|clinvar_20250120_DBVARID|clinvar_20250120_ORIGIN|clinvar_20250120_RS|clinvar_20250120_SSR">"""

# Split the text by "|" and create a list of rows
rows = [line.split("|") for line in data.split("\n")]

# Create a DataFrame from the rows
df = pd.DataFrame(rows)

# Add a second column with row numbers
df["RowNumber"] = df.index + 1


# Transpose the DataFrame using the T attribute
transposed_df = df.T

# Reset the index to move row names into a regular column
df_reset = transposed_df.reset_index()

# Rename the column containing the row names to "name"
df_reset = df_reset.rename(columns={'index': 'index_count'})

# Rename the second column to "VEP"
df_reset = df_reset.rename(columns={df_reset.columns[1]: "VEP"})

# Using + operator to combine two columns into the column "annotations"
df_reset["annotations"] = df_reset['VEP'].astype(str) +" = "+ df_reset["index_count"].astype(str)

# Filter rows based on values in the "VEP" column using the vector of annotations
annotations = ['IMPACT', 'Consequence','SYMBOL','Gene','CANONICAL','MANE_SELECT','MANE_PLUS_CLINICAL','HGVSc','HGVSp','CADD_phred','CADD_raw','CADD_raw_rankscore',
           'gnomADe_AF','gnomADe_NFE_AF','gnomADg_AF','gnomADg_NFE_AF',
           'SpliceAI_pred_DP_AG','SpliceAI_pred_DP_AL','SpliceAI_pred_DP_DG','SpliceAI_pred_DP_DL','SpliceAI_pred_DS_AG','SpliceAI_pred_DS_AL','SpliceAI_pred_DS_DG','SpliceAI_pred_DS_DL','SpliceAI_pred_SYMBOL','AlphaMissense_pred','AlphaMissense_rankscore','AlphaMissense_score','REVEL_rankscore','REVEL_score','Clinvar_CLNSIG','Clinvar_CLNSIGCONF','clinvar_clnsig','REVEL_rankscore','REVEL_score','Reliability_index','MaxEntScan_alt','MaxEntScan_diff','MaxEntScan_ref']

filtered_df = df_reset[df_reset['VEP'].isin(annotations)]

# Get the values of the 'annotations' column and print without index numbers
for value in filtered_df['annotations']:
    print(value)

3.) now I can define variables of interest:

Consequence = 1
IMPACT = 2
SYMBOL = 3
Gene = 4
HGVSc = 10
HGVSp = 11

and then split the CSQ info from VEP:

split_csq = mt.info.CSQ
split_csq = split_csq.map(lambda x: x.split('\\|'))

mt = mt.annotate_rows(split_csq = split_csq)

mt = mt.explode_rows(mt.split_csq)

4.) and finally do the annotate_rows:

mt = mt.annotate_rows(
    IMPACT = mt.split_csq[IMPACT],
    Consequence = mt.split_csq[Consequence],
    SYMBOL = mt.split_csq[SYMBOL],
    Gene = mt.split_csq[Gene],
    HGVSc = mt.split_csq[HGVSc],
    HGVSp = mt.split_csq[HGVSp]
)

Any hint on also improving this part is highly appreciated.

Thanks, I think I’m mostly following now. But I’m still unclear on what the end goal is. You say

I’m not sure what format the gnomad utils expect. Could you elaborate on how you want the data to look, maybe using the example from your first post?

I’m also a bit confused by the last steps in your last post. You explode on mt.split_csq, so after that split_csq will be a single string, not an array any more. But then in step 4 you are indexing into split_csq. I think maybe you don’t want to explode, but I can be of more help if you explain what you want in the end.

Main goal is to reduce the table from the first post to only one row based on certain conditions, so that variants are unique. Probably creating a new variantID in the format chr:pos:ref:alt would be best to be specific. Then I would need a condition that selects the most severe consequence for this variantID based on a certain order e.g. non_coding_transcript_exon_variant would be first choice, synonymous_variant second and upstream_gene_variant third. Still this could leed to multiple transcripts with the same consequence. In this case I have another row-field ‘score’ which I could use for prioritization. Could that somehow work with ‘group_by’ (variant ID)?

There are still a couple of things I don’t understand from your post with the code snippets. In 3), what is mt? Can you share the code that defines it? Is it just importing a VCF? Or is it annotating an existing matrix table with an annotation in a VCF? I’m trying to understand how you came to have multiple rows with the same locus/alleles in the first place.

And second, did you actually run 4)? It looks to me like that would error, since after mt = mt.explode_rows(mt.split_csq), the mt.split_csq field is no longer an array, so indexing into it like mt.split_csq[IMPACT] would be an error. Given that, I can’t tell what those steps are meant to accomplish, or if this is the cause of the duplicated variants.