Hello!

I want to use the `de_novo`

function but I’m not sure if I understand the documentation correctly, and if so, then I’d like to ask for help

It’s about the `prior`

parameter: `It is the maximum of: the computed dataset alternate allele frequency, the pop_frequency_prior parameter, and the global prior 1 / 3e7`

.

Do I understand correctly that ths value, reported back as `prior`

, is what is denoted as `AF`

in the formula for `P(m)`

down in the documentation?

I have a single family to analyze without a larger set of samples from the same population and so it doesn’t seem to make sense to rely on the computed dataset alternate allele frequency as the `AF`

. I think it would be seriously skewed due to presence of only related samples, large fraction of whom are the sick “probands”.

And so I think I need to ensure that the `pop_frequency_prior`

is used as the `AF`

instead of the computed frequency. As if I had an infinity of homref samples in my vcf, next to the family. Would You agree? Is that possible?

This is a good question. I think your concern is valid, but fortunately the algorithm is designed to handle this case – the definition of `computed dataset alternate allele frequency`

is the AAF *without* the putative de novo allele in the individual of interest. This means that if you have a single trio, and your genotype configuration is (0/0 - 0/0) => (0/1), then the `computed dataset alternate allele frequency`

is 0%.

Here’s where the code implements these rules:

I see. So, in case I have more trios, I should probably just externally loop over those trios, each time filtering the columns down to the three samples and calling (something like) `hl.de_novo(mt_trio, hl.Pedigree([trio]), pop_frequency_prior = mt_trio.AF)`

.

Ack, that’s going to be pretty inefficient. Are you worried that you have the same de novo mutation in many families for the same disease, leading to an inflated estimate of in-sample allele frequency due to the ascertainment?

If so, it isn’t actually hard to add a flag to `hl.de_novo`

to just remove the in-sample AF from the calculatino.

Yup.

Assuming I understand correctly, in case when I have, say,

grandparents -> father,

and (father, mother) -> (child1, child2),

and I’m interested in de_novos in father and child1;

and I don’t want to have the AAF of fathers de_novo computed from these 6-persons-population:

I should run the de_novo function separately for father and child1?

This pull request adds a mode that essentially treats each trio as the only one in the dataset:

Many thanks.