Prior and pop_frequency_prior in de_novo

I want to use the de_novo function but I’m not sure if I understand the documentation correctly, and if so, then I’d like to ask for help :slight_smile:
It’s about the prior parameter: It is the maximum of: the computed dataset alternate allele frequency, the pop_frequency_prior parameter, and the global prior 1 / 3e7.
Do I understand correctly that ths value, reported back as prior, is what is denoted as AF in the formula for P(m) down in the documentation?
I have a single family to analyze without a larger set of samples from the same population and so it doesn’t seem to make sense to rely on the computed dataset alternate allele frequency as the AF. I think it would be seriously skewed due to presence of only related samples, large fraction of whom are the sick “probands”.
And so I think I need to ensure that the pop_frequency_prior is used as the AF instead of the computed frequency. As if I had an infinity of homref samples in my vcf, next to the family. Would You agree? Is that possible?

This is a good question. I think your concern is valid, but fortunately the algorithm is designed to handle this case – the definition of computed dataset alternate allele frequency is the AAF without the putative de novo allele in the individual of interest. This means that if you have a single trio, and your genotype configuration is (0/0 - 0/0) => (0/1), then the computed dataset alternate allele frequency is 0%.

Here’s where the code implements these rules:

1 Like

I see. So, in case I have more trios, I should probably just externally loop over those trios, each time filtering the columns down to the three samples and calling (something like) hl.de_novo(mt_trio, hl.Pedigree([trio]), pop_frequency_prior = mt_trio.AF).

Ack, that’s going to be pretty inefficient. Are you worried that you have the same de novo mutation in many families for the same disease, leading to an inflated estimate of in-sample allele frequency due to the ascertainment?

If so, it isn’t actually hard to add a flag to hl.de_novo to just remove the in-sample AF from the calculatino.

Assuming I understand correctly, in case when I have, say,
grandparents -> father,
and (father, mother) -> (child1, child2),
and I’m interested in de_novos in father and child1;
and I don’t want to have the AAF of fathers de_novo computed from these 6-persons-population:
I should run the de_novo function separately for father and child1?

This pull request adds a mode that essentially treats each trio as the only one in the dataset:

Many thanks.