Hi,
I would like to create a data structure on which I add CADD scores (or any other score) to each genomic position contained on an interval. This would mean that for each position, I wouldn’t have 1 score, rather 3 scores for each possible nucleotide substitution at that position.
bed.locus_start.sequence_context(after=bed.after).show()
interval
interval<locus<GRCh37>> str
[1:22148737-1:22263803] "TGGCGGGGATAGCACCGTTTATTAAGAAAAATCAAGACAAAGACCACAGGAGGGTCCCTTCTAGGACACAGA...
[1:94458390-1:94586704] "TAGTTTGTGATGAGTGCATTTGCATTTTATTTTTCCATGAAAATCACACACAACGCAGACACACAGACAAAC...
[1:97543299-1:98460556] "TTTAAAATGCTTTATGATATTTTATTTGATATTATTCAGTAATACAGGTTTTGTGGCAAATATGCATTTCTA...
What would be the best way to annotate / represent this data?
My cadd table is displayed as following:
cadd.show()
locus ref alt raw_score phred_score
locus<GRCh37> str str float64 float64
1:22148737 "T" "A" 1.45e+00 1.56e+01
1:22148737 "T" "C" 1.47e+00 1.57e+01
1:22148737 "T" "G" 1.45e+00 1.56e+01
1:22148738 "G" "A" 1.30e+00 1.49e+01
1:22148738 "G" "C" 1.25e+00 1.47e+01
1:22148738 "G" "T" 1.26e+00 1.47e+0
In theory I could just use the “ref” field as my reference baseline, but this way I wouldn’t be accounting for some possible positions that cadd misses.
I’m running hail on my local cluster, so i think i can’t use annotation_db that is only available on google dataproc.
Any help?
Best,
Pedro