Sorry for resurrecting an old topic, but I’m still feeling somewhat uneasy about the issue of using de_novo
after split_multi_hts
. I would actually agree that the following scenarios make sense, in that de_novo
was intended for such use, but I wonder what You think:
Zooming in on a single multiallelic variant:
Scenario a)
mother: 0/0
father: 0/2
child: 1/2
Only downcoding the first alt results in (0/0+0/0->0/1) and so de_novo will be called once, and the resulting probablity will have the (reasonably valid) interpretation that the child’s allel 1 is a de-novo mutation rather than a genotyping error. No such determination is made about the allel 2. The fact that the original multialllelic configuration was not (homref, homref, hetref), does not invalidate the de-novo-calling, does it? Notably though, I’m disregarding the constraint that allel 2 must have come from father, and the de-novo could have appeared only in place of maternal allel.
Scenario b)
mother, father: 0/0
child: 1/2
After splitting the variant, de_novo
is called on both resulting variants (with identical genotype configurations 0/0 + 0/0 -> 0/1, but possibly different ADs and PLs ), and I could possibly get two corresponding de novo probabilities. Again, the fact that the original multialllelic configuration was not (homref, homref, hetref), does not invalidate the two de-novo-callings. One can start to wonder what is theoretically the right way to aggregate the two result in various scenarios. E.g. I would think that, if the QC is passed in both cases, it makes sense, in first approximation, to return p_1 + p_2 - p_1 * p_2 as the resulting probability of at least one de-novo (whatever that’s worth).
Scenario c)
mother, father: 2/2
child: 0/1
Similarly to scenario a), de-novo will make a determination about child’s allel 1, but the fact that allel 0 is erroneous will be ignored. So, trying to summarize the original variant one could at best claim that the returned de-novo-probablity is a lower bound…
All this musings come from the fact that (sadly) I need to interpret the results of de_novo
back in terms of the multiallelic variants. Can one confidently say that all possible scenarios of split_multi_hts
followed by de_novo
(more or less) make sense and that one should be able to come up with reasonable rules for aggregating the results back into some result that would describe the original multiallelic variant?
(PS Sorry for a long question. I understand that Hail is mainly intended for use in GWAS-like scenarios, with large unrelated populations etc. Any help will be appreciated.)