Mendel_errors and split_multi_hts

Hi!
I’m wondering if there are any potential surprises if the mendel_erros function is used after split_multi_hts. In the documentation of the de_novo function there is an explicit statement that The model looks for de novo events in which both parents are homozygous reference and the proband is a heterozygous, which I’m interpreting as an internal prefiltering step, which will ensure that such variants are ignored in the implemented de-novo-calling. Is any such restriction also applied internally to the input of medel_errors? If not, is there then a possibility of mendel errors not being returned for an originally-multiallelic-but-split-erroneous variant?
E.g.
Before splitting there is, if I understand correctly, an error here:
variant = (chr=1, pos=1, alleles=[A, B, C])
mom: GT=0/0
dad: GT=1/2
kid: 1/2,
but after splitting, I’d have no errors?
variant_1 = (chr=1, pos=1, alleles=[A,B])
variant_2 = (chr=1, pos=1, alleles=[A,C])
mom: GT_1=0/0, GT_2=0/0
dad: GT_1=0/1, GT_2=0/1
kid: GT_1=0/1, GT_2=0/1

Is my understanding correct?

You’re spot on – this configuration is a false negative if you run split_multi_hts / mendel_errors in Hail. We’ve intended for a while to fix mendel errors to work for multiallelic variants, but haven’t had a chance yet.

Thank You for the fast response!
Is there room for a similar issue with the de_novo function?
What would You say is the right methodology for using de_novo on trios that belong to a population where there are multiallelic variants, in order to be sure for each multiallelic variant that it was either ignored or analyzed by the de-novo-calling procedure without the possiblity of ignoring a mendelian error?

I think this isn’t really a problem for de_novo, since that method is looking for rare alleles found in the child that aren’t found in either parent. It’s certainly possible that the configuration you posted above is due to a de novo mutation, but hl.de_novo isn’t looking for this class of de novo mutation anyhow (it’s less likely to be involved with disease, I think).

Multiallelics should be properly handled given the ( 0/0 - 0/0 ) -> 0/1 configuration that de novo is looking for.

Sorry for resurrecting an old topic, but I’m still feeling somewhat uneasy about the issue of using de_novo after split_multi_hts. I would actually agree that the following scenarios make sense, in that de_novo was intended for such use, but I wonder what You think:

Zooming in on a single multiallelic variant:

Scenario a)
mother: 0/0
father: 0/2
child: 1/2
Only downcoding the first alt results in (0/0+0/0->0/1) and so de_novo will be called once, and the resulting probablity will have the (reasonably valid) interpretation that the child’s allel 1 is a de-novo mutation rather than a genotyping error. No such determination is made about the allel 2. The fact that the original multialllelic configuration was not (homref, homref, hetref), does not invalidate the de-novo-calling, does it? Notably though, I’m disregarding the constraint that allel 2 must have come from father, and the de-novo could have appeared only in place of maternal allel.

Scenario b)
mother, father: 0/0
child: 1/2
After splitting the variant, de_novo is called on both resulting variants (with identical genotype configurations 0/0 + 0/0 -> 0/1, but possibly different ADs and PLs ), and I could possibly get two corresponding de novo probabilities. Again, the fact that the original multialllelic configuration was not (homref, homref, hetref), does not invalidate the two de-novo-callings. One can start to wonder what is theoretically the right way to aggregate the two result in various scenarios. E.g. I would think that, if the QC is passed in both cases, it makes sense, in first approximation, to return p_1 + p_2 - p_1 * p_2 as the resulting probability of at least one de-novo (whatever that’s worth).

Scenario c)
mother, father: 2/2
child: 0/1
Similarly to scenario a), de-novo will make a determination about child’s allel 1, but the fact that allel 0 is erroneous will be ignored. So, trying to summarize the original variant one could at best claim that the returned de-novo-probablity is a lower bound…

All this musings come from the fact that (sadly) I need to interpret the results of de_novo back in terms of the multiallelic variants. Can one confidently say that all possible scenarios of split_multi_hts followed by de_novo (more or less) make sense and that one should be able to come up with reasonable rules for aggregating the results back into some result that would describe the original multiallelic variant?

(PS Sorry for a long question. I understand that Hail is mainly intended for use in GWAS-like scenarios, with large unrelated populations etc. Any help will be appreciated.)

This is definitely not what we intend! Dealing with the more complicated analyses, like de novo calling, in sequencing data is certainly one of the applications intended from day one. I’ll address your scenarios individually:

Scenario A

This variant will be split into two biallelics, with allele 1 and allele 2 as alternates. The genotypes for allele 1 will be (0/0 0/0 0/1), and the de novo event will be called if the depths/gqs/etc look good. The split variant coming from allele 2 will have the configuration (0/0 0/1 0/1), and so no de novo will be called.

Scenario B

You are correct in your description of the handling of this event. It will be split into two (0/0 0/1 0/1) configurations, each of which is a de novo candidate. You could certainly get two de novo calls here, but I think it’s still correct to treat each allele individually. However, this is a pretty good signal that the variant is bad and should be thrown out in QC.

Scenario C

This will be split into two variants. The variant for allele 1 will have the configuration (0/0 0/0 0/1), and is a de novo candidate. The variant for allele 2 will have the configuration (1/1 1/1 0/0), and will not be considered a de novo candidate because back-mutations to the reference are likely to be calling/QC errors rather than real events (the algorithm is reference-biased).

Conclusion

I think this all mostly works out. It would certainly be possible to roll a multiallelic de novo caller, but it wouldn’t be a direct translation of the de novo caller that Kaitlin Samocha published.

Thank You.

For starters, here’s my attempt at detecting mendelian multiallelic errors. If anyone spots an error or striking inefficiency, please let me know :slight_smile:

def get_allele_indices(call, n_alleles):
    # "0/0" -> {0},
    # "0/2" -> {0, 2}, etc.
    return hl.enumerate(
        call.one_hot_alleles(n_alleles)
    ).filter(
        lambda _: _[1] > 0
    ).map(
        lambda _: _[0]
    )

def compatible_child_calls(n_alleles, pat_call, mat_call):
    # ["0/0", "1/2"] -> {"0/1", "0/2"}, etc.
    pat_allele_indices = get_allele_indices(pat_call, n_alleles)
    mat_allele_indices = get_allele_indices(mat_call, n_alleles)
    return \
    hl.case().when(
        (hl.len(pat_allele_indices) == 1) & (hl.len(mat_allele_indices) == 1),
        hl.set([
            hl.call(pat_allele_indices[0], mat_allele_indices[0])
        ])
    ).when(
        (hl.len(pat_allele_indices) == 1) & (hl.len(mat_allele_indices) == 2),
        hl.set([
            hl.call(pat_allele_indices[0], mat_allele_indices[0]),
            hl.call(pat_allele_indices[0], mat_allele_indices[1])
        ])
    ).when(
        (hl.len(pat_allele_indices) == 2) & (hl.len(mat_allele_indices) == 1),
        hl.set([
            hl.call(pat_allele_indices[0], mat_allele_indices[0]),
            hl.call(pat_allele_indices[1], mat_allele_indices[0])
        ])
    ).when(
        (hl.len(pat_allele_indices) == 2) & (hl.len(mat_allele_indices) == 2),
        hl.set([
            hl.call(pat_allele_indices[0], mat_allele_indices[0]),
            hl.call(pat_allele_indices[0], mat_allele_indices[1]),
            hl.call(pat_allele_indices[1], mat_allele_indices[0]),
            hl.call(pat_allele_indices[1], mat_allele_indices[1])
        ])
    ).or_missing()
    

def mendel_errors_multi(mt, trio):
    
    # Return ht keyed by mt.row_key,
    # with a single boolean field `error`
    # indicating whether there is a mendelian
    # error given the genotypes of the trio.
    
    assert(
        mt.row_key.dtype.fields == ('locus', 'alleles')
    )
    assert(
        mt.aggregate_cols(
            hl.set(
                [trio.s, trio.pat_id, trio.mat_id]
            ).is_subset(hl.agg.collect_as_set(mt.s))
        )
    )
    
    mendel_ht = mt.rows().select().select_globals(
        s = trio.s,
        pat_id = trio.pat_id,
        mat_id = trio.mat_id
    )
    s_GT = mt[mendel_ht.key, mendel_ht.s].GT
    pat_GT = mt[mendel_ht.key, mendel_ht.pat_id].GT
    mat_GT = mt[mendel_ht.key, mendel_ht.mat_id].GT

    mendel_ht = mendel_ht.annotate(
        error = hl.case().when(
            hl.is_defined(s_GT),
            ~(compatible_child_calls(
                hl.len(mendel_ht.alleles),
                pat_GT,
                mat_GT
            ).map(
                lambda call: call.unphased_diploid_gt_index()
            ).contains(
                s_GT.unphased_diploid_gt_index()
            ))
        ).or_missing()
    )
    
    return mendel_ht