Prior and pop_frequency_prior in de_novo

olszewskip · October 1, 2020, 6:59am

Hello!
I want to use the de_novo function but I’m not sure if I understand the documentation correctly, and if so, then I’d like to ask for help
It’s about the prior parameter: It is the maximum of: the computed dataset alternate allele frequency, the pop_frequency_prior parameter, and the global prior 1 / 3e7.
Do I understand correctly that ths value, reported back as prior, is what is denoted as AF in the formula for P(m) down in the documentation?
I have a single family to analyze without a larger set of samples from the same population and so it doesn’t seem to make sense to rely on the computed dataset alternate allele frequency as the AF. I think it would be seriously skewed due to presence of only related samples, large fraction of whom are the sick “probands”.
And so I think I need to ensure that the pop_frequency_prior is used as the AF instead of the computed frequency. As if I had an infinity of homref samples in my vcf, next to the family. Would You agree? Is that possible?

tpoterba · October 1, 2020, 3:31pm

This is a good question. I think your concern is valid, but fortunately the algorithm is designed to handle this case – the definition of computed dataset alternate allele frequency is the AAF without the putative de novo allele in the individual of interest. This means that if you have a single trio, and your genotype configuration is (0/0 - 0/0) => (0/1), then the computed dataset alternate allele frequency is 0%.

Here’s where the code implements these rules:

github.com

hail-is/hail/blob/main/hail/python/hail/methods/family_methods.py#L754


required_entry_fields = {'GT', 'AD', 'DP', 'GQ', 'PL'}
missing_fields = required_entry_fields - set(mt.entry)
if missing_fields:
    raise ValueError(f"'de_novo': expected 'MatrixTable' to have at least {required_entry_fields}, "
                     f"missing {missing_fields}")

mt = mt.annotate_rows(__prior=pop_frequency_prior,
                      __alt_alleles=hl.agg.sum(mt.GT.n_alt_alleles()),
                      __total_alleles=2 * hl.agg.sum(hl.is_defined(mt.GT)))
# subtract 1 from __alt_alleles to correct for the observed genotype
mt = mt.annotate_rows(__site_freq=hl.max((mt.__alt_alleles - 1) / mt.__total_alleles, mt.__prior, MIN_POP_PRIOR))
mt = require_biallelic(mt, 'de_novo')

# FIXME check that __site_freq is between 0 and 1 when possible in expr
tm = trio_matrix(mt, pedigree, complete_trios=True)

autosomal = tm.locus.in_autosome_or_par() | (tm.locus.in_x_nonpar() & tm.is_female)
hemi_x = tm.locus.in_x_nonpar() & ~tm.is_female
hemi_y = tm.locus.in_y_nonpar() & ~tm.is_female
hemi_mt = tm.locus.in_mito() & tm.is_female

olszewskip · October 1, 2020, 3:40pm

I see. So, in case I have more trios, I should probably just externally loop over those trios, each time filtering the columns down to the three samples and calling (something like) hl.de_novo(mt_trio, hl.Pedigree([trio]), pop_frequency_prior = mt_trio.AF).

tpoterba · October 2, 2020, 3:24pm

Ack, that’s going to be pretty inefficient. Are you worried that you have the same de novo mutation in many families for the same disease, leading to an inflated estimate of in-sample allele frequency due to the ascertainment?

If so, it isn’t actually hard to add a flag to hl.de_novo to just remove the in-sample AF from the calculatino.

olszewskip · October 2, 2020, 4:17pm

Yup.
Assuming I understand correctly, in case when I have, say,
grandparents -> father,
and (father, mother) -> (child1, child2),
and I’m interested in de_novos in father and child1;
and I don’t want to have the AAF of fathers de_novo computed from these 6-persons-population:
I should run the de_novo function separately for father and child1?

tpoterba · October 2, 2020, 5:00pm

This pull request adds a mode that essentially treats each trio as the only one in the dataset:

olszewskip · October 7, 2020, 12:21pm

Many thanks.

Topic		Replies	Views
Pop_frequency_prior format in hail.methods.de_novo Hail Query & hailctl	4	521	August 6, 2019
Creating gnomadFreq.tsv file Hail Query & hailctl	7	778	February 3, 2020
De novo caller clamps allele balance at 0.20 Hail Query & hailctl	0	389	December 23, 2020
De novo calls on hemizygous X variants Science	20	1187	September 16, 2022
Issues grouping by cols and then filtering by GT Hail Query & hailctl	10	769	June 23, 2022

Prior and pop_frequency_prior in de_novo

Related topics