Hello!
I want to use the de_novo
function but I’m not sure if I understand the documentation correctly, and if so, then I’d like to ask for help
It’s about the prior
parameter: It is the maximum of: the computed dataset alternate allele frequency, the pop_frequency_prior parameter, and the global prior 1 / 3e7
.
Do I understand correctly that ths value, reported back as prior
, is what is denoted as AF
in the formula for P(m)
down in the documentation?
I have a single family to analyze without a larger set of samples from the same population and so it doesn’t seem to make sense to rely on the computed dataset alternate allele frequency as the AF
. I think it would be seriously skewed due to presence of only related samples, large fraction of whom are the sick “probands”.
And so I think I need to ensure that the pop_frequency_prior
is used as the AF
instead of the computed frequency. As if I had an infinity of homref samples in my vcf, next to the family. Would You agree? Is that possible?
This is a good question. I think your concern is valid, but fortunately the algorithm is designed to handle this case – the definition of computed dataset alternate allele frequency
is the AAF without the putative de novo allele in the individual of interest. This means that if you have a single trio, and your genotype configuration is (0/0 - 0/0) => (0/1), then the computed dataset alternate allele frequency
is 0%.
Here’s where the code implements these rules:
I see. So, in case I have more trios, I should probably just externally loop over those trios, each time filtering the columns down to the three samples and calling (something like) hl.de_novo(mt_trio, hl.Pedigree([trio]), pop_frequency_prior = mt_trio.AF)
.
Ack, that’s going to be pretty inefficient. Are you worried that you have the same de novo mutation in many families for the same disease, leading to an inflated estimate of in-sample allele frequency due to the ascertainment?
If so, it isn’t actually hard to add a flag to hl.de_novo
to just remove the in-sample AF from the calculatino.
Yup.
Assuming I understand correctly, in case when I have, say,
grandparents -> father,
and (father, mother) -> (child1, child2),
and I’m interested in de_novos in father and child1;
and I don’t want to have the AAF of fathers de_novo computed from these 6-persons-population:
I should run the de_novo function separately for father and child1?
This pull request adds a mode that essentially treats each trio as the only one in the dataset:
Many thanks.