Thanks for the quick response. Maybe I share it from the beginning. It might not be the most efficient or clean code, but for getting the data into hail it works fine:
1.) I make sure to get the right structure of the VEP annotation:
!cat myvcf | grep "##" | grep "CSQ" > myvcf.header.txt
So I am getting something like this in the “myvcf.header.txt”:
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|UNIPROT_ISOFORM|REFSEQ_MATCH|SOURCE|REFSEQ_OFFSET|GENE_PHENO|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|gnomADg_AF|gnomADg_AFR_AF|gnomADg_AMI_AF|gnomADg_AMR_AF|gnomADg_ASJ_AF|gnomADg_EAS_AF|gnomADg_FIN_AF|gnomADg_MID_AF|gnomADg_NFE_AF|gnomADg_OTH_AF|gnomADg_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|1000Gp3_AC|1000Gp3_AF|1000Gp3_AFR_AC|1000Gp3_AFR_AF|1000Gp3_AMR_AC|1000Gp3_AMR_AF|1000Gp3_EAS_AC|1000Gp3_EAS_AF|1000Gp3_EUR_AC|1000Gp3_EUR_AF|1000Gp3_SAS_AC|1000Gp3_SAS_AF|ALFA_African_AC|ALFA_African_AF|ALFA_African_AN|ALFA_African_American_AC|ALFA_African_American_AF|ALFA_African_American_AN|ALFA_African_Others_AC|ALFA_African_Others_AF|ALFA_African_Others_AN|ALFA_Asian_AC|ALFA_Asian_AF|ALFA_Asian_AN|ALFA_East_Asian_AC|ALFA_East_Asian_AF|ALFA_East_Asian_AN|ALFA_European_AC|ALFA_European_AF|ALFA_European_AN|ALFA_Latin_American_1_AC|ALFA_Latin_American_1_AF|ALFA_Latin_American_1_AN|ALFA_Latin_American_2_AC|ALFA_Latin_American_2_AF|ALFA_Latin_American_2_AN|ALFA_Other_AC|ALFA_Other_AF|ALFA_Other_AN|ALFA_Other_Asian_AC|ALFA_Other_Asian_AF|ALFA_Other_Asian_AN|ALFA_South_Asian_AC|ALFA_South_Asian_AF|ALFA_South_Asian_AN|ALFA_Total_AC|ALFA_Total_AF|ALFA_Total_AN|ALSPAC_AC|ALSPAC_AF|APPRIS|Aloft_Confidence|Aloft_Fraction_transcripts_affected|Aloft_pred|Aloft_prob_Dominant|Aloft_prob_Recessive|Aloft_prob_Tolerant|AlphaMissense_pred|AlphaMissense_rankscore|AlphaMissense_score|AltaiNeandertal|Ancestral_allele|BayesDel_addAF_pred|BayesDel_addAF_rankscore|BayesDel_addAF_score|BayesDel_noAF_pred|BayesDel_noAF_rankscore|BayesDel_noAF_score|CADD_phred|CADD_phred_hg19|CADD_raw|CADD_raw_hg19|CADD_raw_rankscore|CADD_raw_rankscore_hg19|ChagyrskayaNeandertal|ClinPred_pred|ClinPred_rankscore|ClinPred_score|DANN_rankscore|DANN_score|DEOGEN2_pred|DEOGEN2_rankscore|DEOGEN2_score|Denisova|ESM1b_pred|ESM1b_rankscore|ESM1b_score|ESP6500_AA_AC|ESP6500_AA_AF|ESP6500_EA_AC|ESP6500_EA_AF|EVE_Class10_pred|EVE_Class20_pred|EVE_Class25_pred|EVE_Class30_pred|EVE_Class40_pred|EVE_Class50_pred|EVE_Class60_pred|EVE_Class70_pred|EVE_Class75_pred|EVE_Class80_pred|EVE_Class90_pred|EVE_rankscore|EVE_score|Eigen-PC-phred_coding|Eigen-PC-raw_coding|Eigen-PC-raw_coding_rankscore|Eigen-phred_coding|Eigen-raw_coding|Eigen-raw_coding_rankscore|Ensembl_geneid|Ensembl_proteinid|Ensembl_transcriptid|ExAC_AC|ExAC_AF|ExAC_AFR_AC|ExAC_AFR_AF|ExAC_AMR_AC|ExAC_AMR_AF|ExAC_Adj_AC|ExAC_Adj_AF|ExAC_EAS_AC|ExAC_EAS_AF|ExAC_FIN_AC|ExAC_FIN_AF|ExAC_NFE_AC|ExAC_NFE_AF|ExAC_SAS_AC|ExAC_SAS_AF|ExAC_nonTCGA_AC|ExAC_nonTCGA_AF|ExAC_nonTCGA_AFR_AC|ExAC_nonTCGA_AFR_AF|ExAC_nonTCGA_AMR_AC|ExAC_nonTCGA_AMR_AF|ExAC_nonTCGA_Adj_AC|ExAC_nonTCGA_Adj_AF|ExAC_nonTCGA_EAS_AC|ExAC_nonTCGA_EAS_AF|ExAC_nonTCGA_FIN_AC|ExAC_nonTCGA_FIN_AF|ExAC_nonTCGA_NFE_AC|ExAC_nonTCGA_NFE_AF|ExAC_nonTCGA_SAS_AC|ExAC_nonTCGA_SAS_AF|ExAC_nonpsych_AC|ExAC_nonpsych_AF|ExAC_nonpsych_AFR_AC|ExAC_nonpsych_AFR_AF|ExAC_nonpsych_AMR_AC|ExAC_nonpsych_AMR_AF|ExAC_nonpsych_Adj_AC|ExAC_nonpsych_Adj_AF|ExAC_nonpsych_EAS_AC|ExAC_nonpsych_EAS_AF|ExAC_nonpsych_FIN_AC|ExAC_nonpsych_FIN_AF|ExAC_nonpsych_NFE_AC|ExAC_nonpsych_NFE_AF|ExAC_nonpsych_SAS_AC|ExAC_nonpsych_SAS_AF|FATHMM_converted_rankscore|FATHMM_pred|FATHMM_score|GENCODE_basic|GERP++_NR|GERP++_RS|GERP++_RS_rankscore|GM12878_confidence_value|GM12878_fitCons_rankscore|GM12878_fitCons_score|GTEx_V8_eQTL_gene|GTEx_V8_eQTL_tissue|GTEx_V8_sQTL_gene|GTEx_V8_sQTL_tissue|GenoCanyon_rankscore|GenoCanyon_score|Geuvadis_eQTL_target_gene|H1-hESC_confidence_value|H1-hESC_fitCons_rankscore|H1-hESC_fitCons_score|HGVSc_ANNOVAR|HGVSc_VEP|HGVSc_snpEff|HGVSp_ANNOVAR|HGVSp_VEP|HGVSp_snpEff|HUVEC_confidence_value|HUVEC_fitCons_rankscore|HUVEC_fitCons_score|Interpro_domain|LINSIGHT|LINSIGHT_rankscore|LIST-S2_pred|LIST-S2_rankscore|LIST-S2_score|LRT_Omega|LRT_converted_rankscore|LRT_pred|LRT_score|M-CAP_pred|M-CAP_rankscore|M-CAP_score|MPC_rankscore|MPC_score|MVP_rankscore|MVP_score|MetaLR_pred|MetaLR_rankscore|MetaLR_score|MetaRNN_pred|MetaRNN_rankscore|MetaRNN_score|MetaSVM_pred|MetaSVM_rankscore|MetaSVM_score|MutPred_AAchange|MutPred_Top5features|MutPred_protID|MutPred_rankscore|MutPred_score|MutationAssessor_pred|MutationAssessor_rankscore|MutationAssessor_score|MutationTaster_AAE|MutationTaster_converted_rankscore|MutationTaster_model|MutationTaster_pred|MutationTaster_score|PROVEAN_converted_rankscore|PROVEAN_pred|PROVEAN_score|Polyphen2_HDIV_pred|Polyphen2_HDIV_rankscore|Polyphen2_HDIV_score|Polyphen2_HVAR_pred|Polyphen2_HVAR_rankscore|Polyphen2_HVAR_score|PrimateAI_pred|PrimateAI_rankscore|PrimateAI_score|REVEL_rankscore|REVEL_score|Reliability_index|SIFT4G_converted_rankscore|SIFT4G_pred|SIFT4G_score|SIFT_converted_rankscore|SIFT_pred|SIFT_score|SiPhy_29way_logOdds|SiPhy_29way_logOdds_rankscore|SiPhy_29way_pi|TSL|TWINSUK_AC|TWINSUK_AF|UK10K_AC|UK10K_AF|Uniprot_acc|Uniprot_entry|VARITY_ER_LOO_rankscore|VARITY_ER_LOO_score|VARITY_ER_rankscore|VARITY_ER_score|VARITY_R_LOO_rankscore|VARITY_R_LOO_score|VARITY_R_rankscore|VARITY_R_score|VEP_canonical|VEST4_rankscore|VEST4_score|VindijiaNeandertal|aaalt|aapos|aaref|alt|bStatistic|bStatistic_converted_rankscore|cds_strand|chr|clinvar_MedGen_id|clinvar_OMIM_id|clinvar_Orphanet_id|clinvar_clnsig|clinvar_hgvs|clinvar_id|clinvar_review|clinvar_trait|clinvar_var_source|codon_degeneracy|codonpos|eQTLGen_cis_or_trans|eQTLGen_gene_id|eQTLGen_gene_symbol|eQTLGen_snp_id|fathmm-MKL_coding_group|fathmm-MKL_coding_pred|fathmm-MKL_coding_rankscore|fathmm-MKL_coding_score|fathmm-XF_coding_pred|fathmm-XF_coding_rankscore|fathmm-XF_coding_score|gMVP_rankscore|gMVP_score|genename|gnomAD_exomes_AC|gnomAD_exomes_AF|gnomAD_exomes_AFR_AC|gnomAD_exomes_AFR_AF|gnomAD_exomes_AFR_AN|gnomAD_exomes_AFR_nhomalt|gnomAD_exomes_AMR_AC|gnomAD_exomes_AMR_AF|gnomAD_exomes_AMR_AN|gnomAD_exomes_AMR_nhomalt|gnomAD_exomes_AN|gnomAD_exomes_ASJ_AC|gnomAD_exomes_ASJ_AF|gnomAD_exomes_ASJ_AN|gnomAD_exomes_ASJ_nhomalt|gnomAD_exomes_EAS_AC|gnomAD_exomes_EAS_AF|gnomAD_exomes_EAS_AN|gnomAD_exomes_EAS_nhomalt|gnomAD_exomes_FIN_AC|gnomAD_exomes_FIN_AF|gnomAD_exomes_FIN_AN|gnomAD_exomes_FIN_nhomalt|gnomAD_exomes_MID_AC|gnomAD_exomes_MID_AF|gnomAD_exomes_MID_AN|gnomAD_exomes_MID_nhomalt|gnomAD_exomes_NFE_AC|gnomAD_exomes_NFE_AF|gnomAD_exomes_NFE_AN|gnomAD_exomes_NFE_nhomalt|gnomAD_exomes_POPMAX_AC|gnomAD_exomes_POPMAX_AF|gnomAD_exomes_POPMAX_AN|gnomAD_exomes_POPMAX_nhomalt|gnomAD_exomes_SAS_AC|gnomAD_exomes_SAS_AF|gnomAD_exomes_SAS_AN|gnomAD_exomes_SAS_nhomalt|gnomAD_exomes_flag|gnomAD_exomes_nhomalt|gnomAD_exomes_non_ukb_AC|gnomAD_exomes_non_ukb_AF|gnomAD_exomes_non_ukb_AFR_AC|gnomAD_exomes_non_ukb_AFR_AF|gnomAD_exomes_non_ukb_AFR_AN|gnomAD_exomes_non_ukb_AFR_nhomalt|gnomAD_exomes_non_ukb_AMR_AC|gnomAD_exomes_non_ukb_AMR_AF|gnomAD_exomes_non_ukb_AMR_AN|gnomAD_exomes_non_ukb_AMR_nhomalt|gnomAD_exomes_non_ukb_AN|gnomAD_exomes_non_ukb_ASJ_AC|gnomAD_exomes_non_ukb_ASJ_AF|gnomAD_exomes_non_ukb_ASJ_AN|gnomAD_exomes_non_ukb_ASJ_nhomalt|gnomAD_exomes_non_ukb_EAS_AC|gnomAD_exomes_non_ukb_EAS_AF|gnomAD_exomes_non_ukb_EAS_AN|gnomAD_exomes_non_ukb_EAS_nhomalt|gnomAD_exomes_non_ukb_FIN_AC|gnomAD_exomes_non_ukb_FIN_AF|gnomAD_exomes_non_ukb_FIN_AN|gnomAD_exomes_non_ukb_FIN_nhomalt|gnomAD_exomes_non_ukb_MID_AC|gnomAD_exomes_non_ukb_MID_AF|gnomAD_exomes_non_ukb_MID_AN|gnomAD_exomes_non_ukb_MID_nhomalt|gnomAD_exomes_non_ukb_NFE_AC|gnomAD_exomes_non_ukb_NFE_AF|gnomAD_exomes_non_ukb_NFE_AN|gnomAD_exomes_non_ukb_NFE_nhomalt|gnomAD_exomes_non_ukb_SAS_AC|gnomAD_exomes_non_ukb_SAS_AF|gnomAD_exomes_non_ukb_SAS_AN|gnomAD_exomes_non_ukb_SAS_nhomalt|gnomAD_exomes_non_ukb_nhomalt|gnomAD_genomes_AC|gnomAD_genomes_AF|gnomAD_genomes_AFR_AC|gnomAD_genomes_AFR_AF|gnomAD_genomes_AFR_AN|gnomAD_genomes_AFR_nhomalt|gnomAD_genomes_AMI_AC|gnomAD_genomes_AMI_AF|gnomAD_genomes_AMI_AN|gnomAD_genomes_AMI_nhomalt|gnomAD_genomes_AMR_AC|gnomAD_genomes_AMR_AF|gnomAD_genomes_AMR_AN|gnomAD_genomes_AMR_nhomalt|gnomAD_genomes_AN|gnomAD_genomes_ASJ_AC|gnomAD_genomes_ASJ_AF|gnomAD_genomes_ASJ_AN|gnomAD_genomes_ASJ_nhomalt|gnomAD_genomes_EAS_AC|gnomAD_genomes_EAS_AF|gnomAD_genomes_EAS_AN|gnomAD_genomes_EAS_nhomalt|gnomAD_genomes_FIN_AC|gnomAD_genomes_FIN_AF|gnomAD_genomes_FIN_AN|gnomAD_genomes_FIN_nhomalt|gnomAD_genomes_MID_AC|gnomAD_genomes_MID_AF|gnomAD_genomes_MID_AN|gnomAD_genomes_MID_nhomalt|gnomAD_genomes_NFE_AC|gnomAD_genomes_NFE_AF|gnomAD_genomes_NFE_AN|gnomAD_genomes_NFE_nhomalt|gnomAD_genomes_POPMAX_AC|gnomAD_genomes_POPMAX_AF|gnomAD_genomes_POPMAX_AN|gnomAD_genomes_POPMAX_nhomalt|gnomAD_genomes_SAS_AC|gnomAD_genomes_SAS_AF|gnomAD_genomes_SAS_AN|gnomAD_genomes_SAS_nhomalt|gnomAD_genomes_flag|gnomAD_genomes_nhomalt|hg18_chr|hg18_pos(1-based)|hg19_chr|hg19_pos(1-based)|integrated_confidence_value|integrated_fitCons_rankscore|integrated_fitCons_score|phastCons100way_vertebrate|phastCons100way_vertebrate_rankscore|phastCons17way_primate|phastCons17way_primate_rankscore|phastCons470way_mammalian|phastCons470way_mammalian_rankscore|phyloP100way_vertebrate|phyloP100way_vertebrate_rankscore|phyloP17way_primate|phyloP17way_primate_rankscore|phyloP470way_mammalian|phyloP470way_mammalian_rankscore|pos(1-based)|ref|refcodon|rs_dbSNP|MES-NCSS_downstream_acceptor|MES-NCSS_downstream_donor|MES-NCSS_upstream_acceptor|MES-NCSS_upstream_donor|MES-SWA_acceptor_alt|MES-SWA_acceptor_diff|MES-SWA_acceptor_ref|MES-SWA_acceptor_ref_comp|MES-SWA_donor_alt|MES-SWA_donor_diff|MES-SWA_donor_ref|MES-SWA_donor_ref_comp|MaxEntScan_alt|MaxEntScan_diff|MaxEntScan_ref|ada_score|rf_score|SpliceAI_pred_DP_AG|SpliceAI_pred_DP_AL|SpliceAI_pred_DP_DG|SpliceAI_pred_DP_DL|SpliceAI_pred_DS_AG|SpliceAI_pred_DS_AL|SpliceAI_pred_DS_DG|SpliceAI_pred_DS_DL|SpliceAI_pred_SYMBOL|HGMD_2020|HGMD_2020_brief|HGMD_2020_pmid|HGMD_2020_disease|HGMD_2020_hgvs|HGMD_2020_confidence|HGMD_2020_variant_type|clinvar_20250120|clinvar_20250120_ALLELEID|clinvar_20250120_CLNSIG|clinvar_20250120_CLNDN|clinvar_20250120_CLNDNINCL|clinvar_20250120_CLNDISDB|clinvar_20250120_CLNDISDBINCL|clinvar_20250120_CLNHGVS|clinvar_20250120_CLNREVSTAT|clinvar_20250120_CLNSIGCONF|clinvar_20250120_CLNSIGINCL|clinvar_20250120_CLNVC|clinvar_20250120_CLNVCSO|clinvar_20250120_CLNVI|clinvar_20250120_DBVARID|clinvar_20250120_ORIGIN|clinvar_20250120_RS|clinvar_20250120_SSR">
2.) now my idea is to use the annotation names (and not array numbers) for the code in ‘annotate_rows’. I am getting an array of annotation names since I will split the CSQ by “|”.
This is the best, chatGPT came up with 
# importing pandas
import pandas as pd
# read text file into pandas DataFrame
#data = pd.read_csv("/mnt/users/haas/projects/hail/mt/lhp.2024-07-03_vep_anno.txt")
# Text data to be read
data = """Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|UNIPROT_ISOFORM|REFSEQ_MATCH|SOURCE|REFSEQ_OFFSET|GENE_PHENO|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|gnomADg_AF|gnomADg_AFR_AF|gnomADg_AMI_AF|gnomADg_AMR_AF|gnomADg_ASJ_AF|gnomADg_EAS_AF|gnomADg_FIN_AF|gnomADg_MID_AF|gnomADg_NFE_AF|gnomADg_OTH_AF|gnomADg_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|1000Gp3_AC|1000Gp3_AF|1000Gp3_AFR_AC|1000Gp3_AFR_AF|1000Gp3_AMR_AC|1000Gp3_AMR_AF|1000Gp3_EAS_AC|1000Gp3_EAS_AF|1000Gp3_EUR_AC|1000Gp3_EUR_AF|1000Gp3_SAS_AC|1000Gp3_SAS_AF|ALFA_African_AC|ALFA_African_AF|ALFA_African_AN|ALFA_African_American_AC|ALFA_African_American_AF|ALFA_African_American_AN|ALFA_African_Others_AC|ALFA_African_Others_AF|ALFA_African_Others_AN|ALFA_Asian_AC|ALFA_Asian_AF|ALFA_Asian_AN|ALFA_East_Asian_AC|ALFA_East_Asian_AF|ALFA_East_Asian_AN|ALFA_European_AC|ALFA_European_AF|ALFA_European_AN|ALFA_Latin_American_1_AC|ALFA_Latin_American_1_AF|ALFA_Latin_American_1_AN|ALFA_Latin_American_2_AC|ALFA_Latin_American_2_AF|ALFA_Latin_American_2_AN|ALFA_Other_AC|ALFA_Other_AF|ALFA_Other_AN|ALFA_Other_Asian_AC|ALFA_Other_Asian_AF|ALFA_Other_Asian_AN|ALFA_South_Asian_AC|ALFA_South_Asian_AF|ALFA_South_Asian_AN|ALFA_Total_AC|ALFA_Total_AF|ALFA_Total_AN|ALSPAC_AC|ALSPAC_AF|APPRIS|Aloft_Confidence|Aloft_Fraction_transcripts_affected|Aloft_pred|Aloft_prob_Dominant|Aloft_prob_Recessive|Aloft_prob_Tolerant|AlphaMissense_pred|AlphaMissense_rankscore|AlphaMissense_score|AltaiNeandertal|Ancestral_allele|BayesDel_addAF_pred|BayesDel_addAF_rankscore|BayesDel_addAF_score|BayesDel_noAF_pred|BayesDel_noAF_rankscore|BayesDel_noAF_score|CADD_phred|CADD_phred_hg19|CADD_raw|CADD_raw_hg19|CADD_raw_rankscore|CADD_raw_rankscore_hg19|ChagyrskayaNeandertal|ClinPred_pred|ClinPred_rankscore|ClinPred_score|DANN_rankscore|DANN_score|DEOGEN2_pred|DEOGEN2_rankscore|DEOGEN2_score|Denisova|ESM1b_pred|ESM1b_rankscore|ESM1b_score|ESP6500_AA_AC|ESP6500_AA_AF|ESP6500_EA_AC|ESP6500_EA_AF|EVE_Class10_pred|EVE_Class20_pred|EVE_Class25_pred|EVE_Class30_pred|EVE_Class40_pred|EVE_Class50_pred|EVE_Class60_pred|EVE_Class70_pred|EVE_Class75_pred|EVE_Class80_pred|EVE_Class90_pred|EVE_rankscore|EVE_score|Eigen-PC-phred_coding|Eigen-PC-raw_coding|Eigen-PC-raw_coding_rankscore|Eigen-phred_coding|Eigen-raw_coding|Eigen-raw_coding_rankscore|Ensembl_geneid|Ensembl_proteinid|Ensembl_transcriptid|ExAC_AC|ExAC_AF|ExAC_AFR_AC|ExAC_AFR_AF|ExAC_AMR_AC|ExAC_AMR_AF|ExAC_Adj_AC|ExAC_Adj_AF|ExAC_EAS_AC|ExAC_EAS_AF|ExAC_FIN_AC|ExAC_FIN_AF|ExAC_NFE_AC|ExAC_NFE_AF|ExAC_SAS_AC|ExAC_SAS_AF|ExAC_nonTCGA_AC|ExAC_nonTCGA_AF|ExAC_nonTCGA_AFR_AC|ExAC_nonTCGA_AFR_AF|ExAC_nonTCGA_AMR_AC|ExAC_nonTCGA_AMR_AF|ExAC_nonTCGA_Adj_AC|ExAC_nonTCGA_Adj_AF|ExAC_nonTCGA_EAS_AC|ExAC_nonTCGA_EAS_AF|ExAC_nonTCGA_FIN_AC|ExAC_nonTCGA_FIN_AF|ExAC_nonTCGA_NFE_AC|ExAC_nonTCGA_NFE_AF|ExAC_nonTCGA_SAS_AC|ExAC_nonTCGA_SAS_AF|ExAC_nonpsych_AC|ExAC_nonpsych_AF|ExAC_nonpsych_AFR_AC|ExAC_nonpsych_AFR_AF|ExAC_nonpsych_AMR_AC|ExAC_nonpsych_AMR_AF|ExAC_nonpsych_Adj_AC|ExAC_nonpsych_Adj_AF|ExAC_nonpsych_EAS_AC|ExAC_nonpsych_EAS_AF|ExAC_nonpsych_FIN_AC|ExAC_nonpsych_FIN_AF|ExAC_nonpsych_NFE_AC|ExAC_nonpsych_NFE_AF|ExAC_nonpsych_SAS_AC|ExAC_nonpsych_SAS_AF|FATHMM_converted_rankscore|FATHMM_pred|FATHMM_score|GENCODE_basic|GERP++_NR|GERP++_RS|GERP++_RS_rankscore|GM12878_confidence_value|GM12878_fitCons_rankscore|GM12878_fitCons_score|GTEx_V8_eQTL_gene|GTEx_V8_eQTL_tissue|GTEx_V8_sQTL_gene|GTEx_V8_sQTL_tissue|GenoCanyon_rankscore|GenoCanyon_score|Geuvadis_eQTL_target_gene|H1-hESC_confidence_value|H1-hESC_fitCons_rankscore|H1-hESC_fitCons_score|HGVSc_ANNOVAR|HGVSc_VEP|HGVSc_snpEff|HGVSp_ANNOVAR|HGVSp_VEP|HGVSp_snpEff|HUVEC_confidence_value|HUVEC_fitCons_rankscore|HUVEC_fitCons_score|Interpro_domain|LINSIGHT|LINSIGHT_rankscore|LIST-S2_pred|LIST-S2_rankscore|LIST-S2_score|LRT_Omega|LRT_converted_rankscore|LRT_pred|LRT_score|M-CAP_pred|M-CAP_rankscore|M-CAP_score|MPC_rankscore|MPC_score|MVP_rankscore|MVP_score|MetaLR_pred|MetaLR_rankscore|MetaLR_score|MetaRNN_pred|MetaRNN_rankscore|MetaRNN_score|MetaSVM_pred|MetaSVM_rankscore|MetaSVM_score|MutPred_AAchange|MutPred_Top5features|MutPred_protID|MutPred_rankscore|MutPred_score|MutationAssessor_pred|MutationAssessor_rankscore|MutationAssessor_score|MutationTaster_AAE|MutationTaster_converted_rankscore|MutationTaster_model|MutationTaster_pred|MutationTaster_score|PROVEAN_converted_rankscore|PROVEAN_pred|PROVEAN_score|Polyphen2_HDIV_pred|Polyphen2_HDIV_rankscore|Polyphen2_HDIV_score|Polyphen2_HVAR_pred|Polyphen2_HVAR_rankscore|Polyphen2_HVAR_score|PrimateAI_pred|PrimateAI_rankscore|PrimateAI_score|REVEL_rankscore|REVEL_score|Reliability_index|SIFT4G_converted_rankscore|SIFT4G_pred|SIFT4G_score|SIFT_converted_rankscore|SIFT_pred|SIFT_score|SiPhy_29way_logOdds|SiPhy_29way_logOdds_rankscore|SiPhy_29way_pi|TSL|TWINSUK_AC|TWINSUK_AF|UK10K_AC|UK10K_AF|Uniprot_acc|Uniprot_entry|VARITY_ER_LOO_rankscore|VARITY_ER_LOO_score|VARITY_ER_rankscore|VARITY_ER_score|VARITY_R_LOO_rankscore|VARITY_R_LOO_score|VARITY_R_rankscore|VARITY_R_score|VEP_canonical|VEST4_rankscore|VEST4_score|VindijiaNeandertal|aaalt|aapos|aaref|alt|bStatistic|bStatistic_converted_rankscore|cds_strand|chr|clinvar_MedGen_id|clinvar_OMIM_id|clinvar_Orphanet_id|clinvar_clnsig|clinvar_hgvs|clinvar_id|clinvar_review|clinvar_trait|clinvar_var_source|codon_degeneracy|codonpos|eQTLGen_cis_or_trans|eQTLGen_gene_id|eQTLGen_gene_symbol|eQTLGen_snp_id|fathmm-MKL_coding_group|fathmm-MKL_coding_pred|fathmm-MKL_coding_rankscore|fathmm-MKL_coding_score|fathmm-XF_coding_pred|fathmm-XF_coding_rankscore|fathmm-XF_coding_score|gMVP_rankscore|gMVP_score|genename|gnomAD_exomes_AC|gnomAD_exomes_AF|gnomAD_exomes_AFR_AC|gnomAD_exomes_AFR_AF|gnomAD_exomes_AFR_AN|gnomAD_exomes_AFR_nhomalt|gnomAD_exomes_AMR_AC|gnomAD_exomes_AMR_AF|gnomAD_exomes_AMR_AN|gnomAD_exomes_AMR_nhomalt|gnomAD_exomes_AN|gnomAD_exomes_ASJ_AC|gnomAD_exomes_ASJ_AF|gnomAD_exomes_ASJ_AN|gnomAD_exomes_ASJ_nhomalt|gnomAD_exomes_EAS_AC|gnomAD_exomes_EAS_AF|gnomAD_exomes_EAS_AN|gnomAD_exomes_EAS_nhomalt|gnomAD_exomes_FIN_AC|gnomAD_exomes_FIN_AF|gnomAD_exomes_FIN_AN|gnomAD_exomes_FIN_nhomalt|gnomAD_exomes_MID_AC|gnomAD_exomes_MID_AF|gnomAD_exomes_MID_AN|gnomAD_exomes_MID_nhomalt|gnomAD_exomes_NFE_AC|gnomAD_exomes_NFE_AF|gnomAD_exomes_NFE_AN|gnomAD_exomes_NFE_nhomalt|gnomAD_exomes_POPMAX_AC|gnomAD_exomes_POPMAX_AF|gnomAD_exomes_POPMAX_AN|gnomAD_exomes_POPMAX_nhomalt|gnomAD_exomes_SAS_AC|gnomAD_exomes_SAS_AF|gnomAD_exomes_SAS_AN|gnomAD_exomes_SAS_nhomalt|gnomAD_exomes_flag|gnomAD_exomes_nhomalt|gnomAD_exomes_non_ukb_AC|gnomAD_exomes_non_ukb_AF|gnomAD_exomes_non_ukb_AFR_AC|gnomAD_exomes_non_ukb_AFR_AF|gnomAD_exomes_non_ukb_AFR_AN|gnomAD_exomes_non_ukb_AFR_nhomalt|gnomAD_exomes_non_ukb_AMR_AC|gnomAD_exomes_non_ukb_AMR_AF|gnomAD_exomes_non_ukb_AMR_AN|gnomAD_exomes_non_ukb_AMR_nhomalt|gnomAD_exomes_non_ukb_AN|gnomAD_exomes_non_ukb_ASJ_AC|gnomAD_exomes_non_ukb_ASJ_AF|gnomAD_exomes_non_ukb_ASJ_AN|gnomAD_exomes_non_ukb_ASJ_nhomalt|gnomAD_exomes_non_ukb_EAS_AC|gnomAD_exomes_non_ukb_EAS_AF|gnomAD_exomes_non_ukb_EAS_AN|gnomAD_exomes_non_ukb_EAS_nhomalt|gnomAD_exomes_non_ukb_FIN_AC|gnomAD_exomes_non_ukb_FIN_AF|gnomAD_exomes_non_ukb_FIN_AN|gnomAD_exomes_non_ukb_FIN_nhomalt|gnomAD_exomes_non_ukb_MID_AC|gnomAD_exomes_non_ukb_MID_AF|gnomAD_exomes_non_ukb_MID_AN|gnomAD_exomes_non_ukb_MID_nhomalt|gnomAD_exomes_non_ukb_NFE_AC|gnomAD_exomes_non_ukb_NFE_AF|gnomAD_exomes_non_ukb_NFE_AN|gnomAD_exomes_non_ukb_NFE_nhomalt|gnomAD_exomes_non_ukb_SAS_AC|gnomAD_exomes_non_ukb_SAS_AF|gnomAD_exomes_non_ukb_SAS_AN|gnomAD_exomes_non_ukb_SAS_nhomalt|gnomAD_exomes_non_ukb_nhomalt|gnomAD_genomes_AC|gnomAD_genomes_AF|gnomAD_genomes_AFR_AC|gnomAD_genomes_AFR_AF|gnomAD_genomes_AFR_AN|gnomAD_genomes_AFR_nhomalt|gnomAD_genomes_AMI_AC|gnomAD_genomes_AMI_AF|gnomAD_genomes_AMI_AN|gnomAD_genomes_AMI_nhomalt|gnomAD_genomes_AMR_AC|gnomAD_genomes_AMR_AF|gnomAD_genomes_AMR_AN|gnomAD_genomes_AMR_nhomalt|gnomAD_genomes_AN|gnomAD_genomes_ASJ_AC|gnomAD_genomes_ASJ_AF|gnomAD_genomes_ASJ_AN|gnomAD_genomes_ASJ_nhomalt|gnomAD_genomes_EAS_AC|gnomAD_genomes_EAS_AF|gnomAD_genomes_EAS_AN|gnomAD_genomes_EAS_nhomalt|gnomAD_genomes_FIN_AC|gnomAD_genomes_FIN_AF|gnomAD_genomes_FIN_AN|gnomAD_genomes_FIN_nhomalt|gnomAD_genomes_MID_AC|gnomAD_genomes_MID_AF|gnomAD_genomes_MID_AN|gnomAD_genomes_MID_nhomalt|gnomAD_genomes_NFE_AC|gnomAD_genomes_NFE_AF|gnomAD_genomes_NFE_AN|gnomAD_genomes_NFE_nhomalt|gnomAD_genomes_POPMAX_AC|gnomAD_genomes_POPMAX_AF|gnomAD_genomes_POPMAX_AN|gnomAD_genomes_POPMAX_nhomalt|gnomAD_genomes_SAS_AC|gnomAD_genomes_SAS_AF|gnomAD_genomes_SAS_AN|gnomAD_genomes_SAS_nhomalt|gnomAD_genomes_flag|gnomAD_genomes_nhomalt|hg18_chr|hg18_pos(1-based)|hg19_chr|hg19_pos(1-based)|integrated_confidence_value|integrated_fitCons_rankscore|integrated_fitCons_score|phastCons100way_vertebrate|phastCons100way_vertebrate_rankscore|phastCons17way_primate|phastCons17way_primate_rankscore|phastCons470way_mammalian|phastCons470way_mammalian_rankscore|phyloP100way_vertebrate|phyloP100way_vertebrate_rankscore|phyloP17way_primate|phyloP17way_primate_rankscore|phyloP470way_mammalian|phyloP470way_mammalian_rankscore|pos(1-based)|ref|refcodon|rs_dbSNP|MES-NCSS_downstream_acceptor|MES-NCSS_downstream_donor|MES-NCSS_upstream_acceptor|MES-NCSS_upstream_donor|MES-SWA_acceptor_alt|MES-SWA_acceptor_diff|MES-SWA_acceptor_ref|MES-SWA_acceptor_ref_comp|MES-SWA_donor_alt|MES-SWA_donor_diff|MES-SWA_donor_ref|MES-SWA_donor_ref_comp|MaxEntScan_alt|MaxEntScan_diff|MaxEntScan_ref|ada_score|rf_score|SpliceAI_pred_DP_AG|SpliceAI_pred_DP_AL|SpliceAI_pred_DP_DG|SpliceAI_pred_DP_DL|SpliceAI_pred_DS_AG|SpliceAI_pred_DS_AL|SpliceAI_pred_DS_DG|SpliceAI_pred_DS_DL|SpliceAI_pred_SYMBOL|HGMD_2020|HGMD_2020_brief|HGMD_2020_pmid|HGMD_2020_disease|HGMD_2020_hgvs|HGMD_2020_confidence|HGMD_2020_variant_type|clinvar_20250120|clinvar_20250120_ALLELEID|clinvar_20250120_CLNSIG|clinvar_20250120_CLNDN|clinvar_20250120_CLNDNINCL|clinvar_20250120_CLNDISDB|clinvar_20250120_CLNDISDBINCL|clinvar_20250120_CLNHGVS|clinvar_20250120_CLNREVSTAT|clinvar_20250120_CLNSIGCONF|clinvar_20250120_CLNSIGINCL|clinvar_20250120_CLNVC|clinvar_20250120_CLNVCSO|clinvar_20250120_CLNVI|clinvar_20250120_DBVARID|clinvar_20250120_ORIGIN|clinvar_20250120_RS|clinvar_20250120_SSR">"""
# Split the text by "|" and create a list of rows
rows = [line.split("|") for line in data.split("\n")]
# Create a DataFrame from the rows
df = pd.DataFrame(rows)
# Add a second column with row numbers
df["RowNumber"] = df.index + 1
# Transpose the DataFrame using the T attribute
transposed_df = df.T
# Reset the index to move row names into a regular column
df_reset = transposed_df.reset_index()
# Rename the column containing the row names to "name"
df_reset = df_reset.rename(columns={'index': 'index_count'})
# Rename the second column to "VEP"
df_reset = df_reset.rename(columns={df_reset.columns[1]: "VEP"})
# Using + operator to combine two columns into the column "annotations"
df_reset["annotations"] = df_reset['VEP'].astype(str) +" = "+ df_reset["index_count"].astype(str)
# Filter rows based on values in the "VEP" column using the vector of annotations
annotations = ['IMPACT', 'Consequence','SYMBOL','Gene','CANONICAL','MANE_SELECT','MANE_PLUS_CLINICAL','HGVSc','HGVSp','CADD_phred','CADD_raw','CADD_raw_rankscore',
'gnomADe_AF','gnomADe_NFE_AF','gnomADg_AF','gnomADg_NFE_AF',
'SpliceAI_pred_DP_AG','SpliceAI_pred_DP_AL','SpliceAI_pred_DP_DG','SpliceAI_pred_DP_DL','SpliceAI_pred_DS_AG','SpliceAI_pred_DS_AL','SpliceAI_pred_DS_DG','SpliceAI_pred_DS_DL','SpliceAI_pred_SYMBOL','AlphaMissense_pred','AlphaMissense_rankscore','AlphaMissense_score','REVEL_rankscore','REVEL_score','Clinvar_CLNSIG','Clinvar_CLNSIGCONF','clinvar_clnsig','REVEL_rankscore','REVEL_score','Reliability_index','MaxEntScan_alt','MaxEntScan_diff','MaxEntScan_ref']
filtered_df = df_reset[df_reset['VEP'].isin(annotations)]
# Get the values of the 'annotations' column and print without index numbers
for value in filtered_df['annotations']:
print(value)
3.) now I can define variables of interest:
Consequence = 1
IMPACT = 2
SYMBOL = 3
Gene = 4
HGVSc = 10
HGVSp = 11
and then split the CSQ info from VEP:
split_csq = mt.info.CSQ
split_csq = split_csq.map(lambda x: x.split('\\|'))
mt = mt.annotate_rows(split_csq = split_csq)
mt = mt.explode_rows(mt.split_csq)
4.) and finally do the annotate_rows:
mt = mt.annotate_rows(
IMPACT = mt.split_csq[IMPACT],
Consequence = mt.split_csq[Consequence],
SYMBOL = mt.split_csq[SYMBOL],
Gene = mt.split_csq[Gene],
HGVSc = mt.split_csq[HGVSc],
HGVSp = mt.split_csq[HGVSp]
)
Any hint on also improving this part is highly appreciated.