Hi everyone, I apologize if this question does not fall in the purview of this forum.
I was going through the gnomAD constraint calculation code (https://github.com/broadinstitute/gnomad_lof/blob/master/constraint_utils/constraint_basics.py line 318) and had some questions regarding the calculation of the mutation probabilities.
I do not know the shape of the data inside genome_ht and context_ht as I have never used hail.
My question is: How are the probabilites of a base mutation calculated? I find supplementary documentation vague when they talk about its calculation.
There are 3 ways that I have narrowed down the calculation to, here they are:
A= no. of AAA>ATA mutations in the whole genome.
B= Total no. of AAA context mutations in the whole genome. This is basically AAA>ATA + AAA>ACA + AAA>AGA mutations.
C= No. of times AAA occurs in the whole genome. In the sequence AAAAA, AAA occurs 3 times.
Which of the following is the correct equation for the calculation of probability of AAA>ATA mutation?
- A/B
- A/C
- A/(C*3)
- Or am I completely wrong?
Also, whats the logic behind the correction factor to calculate mu from these probabilities?
Thanks a lot