I want to use the
de_novo function but I’m not sure if I understand the documentation correctly, and if so, then I’d like to ask for help
It’s about the
It is the maximum of: the computed dataset alternate allele frequency, the pop_frequency_prior parameter, and the global prior 1 / 3e7.
Do I understand correctly that ths value, reported back as
prior, is what is denoted as
AF in the formula for
P(m) down in the documentation?
I have a single family to analyze without a larger set of samples from the same population and so it doesn’t seem to make sense to rely on the computed dataset alternate allele frequency as the
AF. I think it would be seriously skewed due to presence of only related samples, large fraction of whom are the sick “probands”.
And so I think I need to ensure that the
pop_frequency_prior is used as the
AF instead of the computed frequency. As if I had an infinity of homref samples in my vcf, next to the family. Would You agree? Is that possible?
This is a good question. I think your concern is valid, but fortunately the algorithm is designed to handle this case – the definition of
computed dataset alternate allele frequency is the AAF without the putative de novo allele in the individual of interest. This means that if you have a single trio, and your genotype configuration is (0/0 - 0/0) => (0/1), then the
computed dataset alternate allele frequency is 0%.
Here’s where the code implements these rules:
I see. So, in case I have more trios, I should probably just externally loop over those trios, each time filtering the columns down to the three samples and calling (something like)
hl.de_novo(mt_trio, hl.Pedigree([trio]), pop_frequency_prior = mt_trio.AF).
Ack, that’s going to be pretty inefficient. Are you worried that you have the same de novo mutation in many families for the same disease, leading to an inflated estimate of in-sample allele frequency due to the ascertainment?
If so, it isn’t actually hard to add a flag to
hl.de_novo to just remove the in-sample AF from the calculatino.
Assuming I understand correctly, in case when I have, say,
grandparents -> father,
and (father, mother) -> (child1, child2),
and I’m interested in de_novos in father and child1;
and I don’t want to have the AAF of fathers de_novo computed from these 6-persons-population:
I should run the de_novo function separately for father and child1?
This pull request adds a mode that essentially treats each trio as the only one in the dataset: