De novo calls on hemizygous X variants

Hi! I’m using hail’s de novo caller (Hail | Genetics) and I was wondering how it handles hemizygous variants on the X chromosome in males? I’ve used the caller to successfully call DNVs on the X chromosome in females and autosomal DNVs, but it doesn’t seem to make hemizygous X chromosome calls in males.


hl.de_novo should support this, but was written to directly translate Kaitlin Samocha’s de novo caller several years ago, and expects that calls on sex chromosomes are (incorrectly) encoded as diploid. Do you have males encoded correctly as hemizygous haploid calls?

If so, we can write a preprocessing annotate_entries to transform this into something compatible with the de_novo algorithm.

Thanks! The males GT are straight from a GATK calling pipeline and the X chromosome GT are incorrectly encoded as diploid for male samples. What I was hoping for hail to detect is the scenario on the X Chromosome where parents GT are both 0/0 and the male proband is 1/1.

Hmm, I expected it to be working in this case – let me investigate.

In calling de novos on ASC data, we do get de novos on the X in male probands, but we really don’t get very many (only like 1 per ~750 male probands). One issue is that sperm are more mutable than oocytes, and a male proband is only going to get an X from the oocyte. But one of the things on my to-do list has been to take another look at the algorithm and the filtering I do upstream and see if there’s anything I do that has hamstrung the process (like filtering out het calls in males in hemizygous regions). Tim, I’d be interested to know if you do look at this and anything stands out to you.

It looks like the method is just broken for this configuration. I think if you recode 1/1 male calls to 0/1 in hemizygous regions manually, you will get calls.

Thanks Tim. I’m taking a closer look at our calls on the X, and I’m noticing that the parents are always 0/0 and the male proband call is always 0/1. Maybe we’re only getting the ones where Hail think it’s pseudoautosomal and so my pipeline didn’t filter the male het call?

In checking the de novo calls we had with the previous dataset on Hail 0.1, I do see several calls with 0/0 parents and a 1/1 male proband. Is there something that could have changed between 0.1 and 0.2?

I don’t think this changed between 0.1 and 0.2, especially with the pretty big randomly generated test, but it’s possible. The algorithm right now doesn’t produce calls (ever) with an 0/0 0/0 1/1 configuration on hemi X males, and I wonder if Kaitlin’s original caller did this or not. Regardless, we should change it (even producing a divergence between the Hail caller and Kaitlin’s) so that we do generate calls in this case.

OK. Hmm… I’ll have to look at my 0.1 code and the caller code and see if I had done anything to handle that. It’s certainly something I need to handle going forward. And to confirm, I looked at the positions of the few male X de novos that I do have, and they are all pseudoautosomal, so the 0/1 in the proband is legit.

Hey Tim and Kyle, long time no chat. Hope all is well. I’ve been lurking on this thread as Nick is a grad student in my lab.

It would be great to make the changes for male, as it appears to be only looking for 0/0 for parents and 0/1 for child scenarios. It works for autosomes and even female probands on chrX but it breaks for males where 1/1 is observed instead.

We are updating the pipeline used for analysis in the Pediatric Cardiac Genomics Consortium (PCGC) to use hail as its awesome and cool though debatable if the developers are cool.

TL;DR version: cool can you please update de novo calling, happy to test and feedback!

Hope you’re well, Monkol!

Kyle and I were just talking out-of-band about this. The short summary is that the hemizygous calling isn’t working as expected, though it’s still an open question whether these semantics match Kaitlin’s original caller (which I tried to replicate exactly in Hail). It’s not hard to fix this; I should be able to do that tomorrow!


Hi Monkol! Nice to hear from you. One of the questions that Tim raised when we were talking is whether the ability to call de novos from 0/0 parents and a 1/1 kid is something we would want in general rather than just in hemizygous regions (in case the kid is actually het but happens to have an allele balance that looks homvar). I think I come down on the side of “yes”. I usually use a depth minimum of 10 for autosomal calls, so I don’t think it would get me many new ones, but there might be others who are working with lower thresholds. And if it uncovers a new kind of error mode it would be easy to filter out of the list of calls. Do you have thoughts on this?

Hi all – as an update, I talked about the de novo caller code with Dave Cutler (genetics faculty at Emory and a PI in our consortium), and I’m currently trying it this way (which also incorporates some of Tim’s edits from his pull request):

import hail as hl
from hail.genetics.pedigree import Pedigree
from hail.matrixtable import MatrixTable
from hail.expr import expr_float64
from hail.table import Table
from hail.typecheck import typecheck, numeric
from hail.methods.misc import require_biallelic

def my_de_novo(mt: MatrixTable,
            pedigree: Pedigree,
            min_gq: int = 20,
            min_p: float = 0.05,
            max_parent_ab: float = 0.05,
            min_child_ab: float = 0.20,
            min_dp_ratio: float = 0.10,
            ignore_in_sample_allele_frequency: bool = False) -> Table:
    DE_NOVO_PRIOR = 1 / 30000000
    MIN_POP_PRIOR = 100 / 30000000

    required_entry_fields = {'GT', 'AD', 'DP', 'GQ', 'PL'}
    missing_fields = required_entry_fields - set(mt.entry)
    if missing_fields:
        raise ValueError(f"'de_novo': expected 'MatrixTable' to have at least {required_entry_fields}, "
                         f"missing {missing_fields}")

    pop_frequency_prior = \
        .when((pop_frequency_prior >= 0) & (pop_frequency_prior <= 1), pop_frequency_prior) \
        .or_error(hl.str("de_novo: expect 0 <= pop_frequency_prior <= 1, found " + hl.str(pop_frequency_prior)))

    if ignore_in_sample_allele_frequency:
        # this mode is used when families larger than a single trio are observed, in which
        # an allele might be de novo in a parent and transmitted to a child in the dataset.
        # The original model does not handle this case correctly, and so this experimental
        # mode can be used to treat each trio as if it were the only one in the dataset.
        mt = mt.annotate_rows(__prior=pop_frequency_prior,
                              __site_freq=hl.max(pop_frequency_prior, MIN_POP_PRIOR))
        n_alt_alleles = hl.agg.sum(mt.GT.n_alt_alleles())
        total_alleles = 2 * hl.agg.sum(hl.is_defined(mt.GT))
        # subtract 1 from __alt_alleles to correct for the observed genotype
        mt = mt.annotate_rows(__prior=pop_frequency_prior,
                              __site_freq=hl.max((n_alt_alleles - 1) / total_alleles,

    mt = require_biallelic(mt, 'de_novo')

    tm = hl.trio_matrix(mt, pedigree, complete_trios=True)
    tm = tm.annotate_rows(,
    autosomal = tm.__autosomal | (tm.__x_nonpar & tm.is_female)
    hemi_x = tm.__x_nonpar & ~tm.is_female
    hemi_y = tm.__y_nonpar & ~tm.is_female
    hemi_mt = tm.__mito

    is_snp = hl.is_snp(tm.alleles[0], tm.alleles[1])
    n_alt_alleles = tm.__alt_alleles
    prior = tm.__site_freq
    # Updated for hemizygous child calls to not require a call in uninvolved parent
    has_candidate_gt_configuration = (
        ( autosomal & tm.proband_entry.GT.is_het() & 
          tm.father_entry.GT.is_hom_ref() & tm.mother_entry.GT.is_hom_ref() ) |
        ( hemi_x & tm.proband_entry.GT.is_hom_var() & tm.mother_entry.GT.is_hom_ref() ) |
        ( hemi_y & tm.proband_entry.GT.is_hom_var() & tm.father_entry.GT.is_hom_ref() ) |
        ( hemi_mt & tm.proband_entry.GT.is_hom_var() & tm.mother_entry.GT.is_hom_ref() ) )
    failure = hl.null(hl.tstruct(p_de_novo=hl.tfloat64, confidence=hl.tstr))
    kid = tm.proband_entry
    dad = tm.father_entry
    mom = tm.mother_entry

    kid_linear_pl = 10 ** (-kid.PL / 10)
    kid_pp = hl.bind(lambda x: x / hl.sum(x), kid_linear_pl)

    dad_linear_pl = 10 ** (-dad.PL / 10)
    dad_pp = hl.bind(lambda x: x / hl.sum(x), dad_linear_pl)

    mom_linear_pl = 10 ** (-mom.PL / 10)
    mom_pp = hl.bind(lambda x: x / hl.sum(x), mom_linear_pl)

    kid_ad_ratio = kid.AD[1] / hl.sum(kid.AD)
    # Try to get these all to an expected value of 0.5
    dp_ratio = (
                  .when(hemi_x, kid.DP / mom.DP) # Because mom is diploid but kid is not
                  .when(hemi_y, kid.DP / (2*dad.DP))
                  .when(hemi_mt, kid.DP / (2*mom.DP))
                  .when(tm.__x_nonpar & tm.is_female, (kid.DP / (mom.DP + dad.DP)) * (3/4)) # Because of hemi dad
                  .default( kid.DP / (mom.DP + dad.DP) ))

    def solve(p_de_novo, parent_tot_read_check, parent_alt_read_check):
        return (
            .when(kid.GQ < min_gq, failure)
            .when((dp_ratio < min_dp_ratio)
                  | (kid_ad_ratio < min_child_ab), failure)
            .when(~parent_tot_read_check, failure)
            .when(~parent_alt_read_check, failure)
            .when(p_de_novo < min_p, failure)
                  .when((p_de_novo > 0.99) & (kid_ad_ratio > 0.3) & (n_alt_alleles == 1),
                        hl.struct(p_de_novo=p_de_novo, confidence='HIGH'))
                  .when((p_de_novo > 0.5) & (kid_ad_ratio > 0.3) & (n_alt_alleles <= 5),
                        hl.struct(p_de_novo=p_de_novo, confidence='MEDIUM'))
                  .when(kid_ad_ratio > 0.2,
                        hl.struct(p_de_novo=p_de_novo, confidence='LOW'))
                     .when(((p_de_novo > 0.99) & (kid_ad_ratio > 0.3) & (dp_ratio > 0.2))
                           | ((p_de_novo > 0.99) & (kid_ad_ratio > 0.3) & (n_alt_alleles == 1))
                           | ((p_de_novo > 0.5) & (kid_ad_ratio > 0.3) & (n_alt_alleles < 10) & (kid.DP > 10)),
                           hl.struct(p_de_novo=p_de_novo, confidence='HIGH'))
                     .when((p_de_novo > 0.5) & ((kid_ad_ratio > 0.3) | (n_alt_alleles == 1)),
                           hl.struct(p_de_novo=p_de_novo, confidence='MEDIUM'))
                     .when(kid_ad_ratio > 0.2,
                           hl.struct(p_de_novo=p_de_novo, confidence='LOW'))
    # Call autosomal or pseudoautosomal de novo variants
    def call_auto(kid_pp, dad_pp, mom_pp):
        p_data_given_dn = dad_pp[0] * mom_pp[0] * kid_pp[1] * DE_NOVO_PRIOR
        p_het_in_parent = 1 - (1 - prior) ** 4
        p_data_given_missed_het = (dad_pp[1]*mom_pp[0] + dad_pp[0]*mom_pp[1]) * kid_pp[1] * p_het_in_parent
        # Note: a full treatment would include possibility kid genotype is in error:
        # "+ ( (dad_pp[0]*mom_pp[0] + dad_pp[0]*mom_pp[0]) * kid_pp[0] * (1 - p_het_in_parent) )",
        # which is approximately (dad_pp[0]*mom_pp[0] + dad_pp[0]*mom_pp[0]) * kid_pp[0]
        p_de_novo = p_data_given_dn / (p_data_given_dn + p_data_given_missed_het)

        parent_tot_read_check = (hl.sum(mom.AD) > 0) & (hl.sum(dad.AD) > 0)
        parent_alt_read_check = ( ((mom.AD[1] / hl.sum(mom.AD)) <= max_parent_ab) &
                                  ((dad.AD[1] / hl.sum(dad.AD)) <= max_parent_ab) )
        return hl.bind(solve, p_de_novo, parent_tot_read_check, parent_alt_read_check)

    # Call hemizygous de novo variants on the X
    def call_hemi_x(kid_pp, parent, parent_pp):
        p_data_given_dn = parent_pp[0] * kid_pp[2] * DE_NOVO_PRIOR
        p_het_in_parent = 1 - (1 - prior) ** 2
        p_data_given_missed_het = ((parent_pp[1] + parent_pp[2]) * kid_pp[2] * 
                                   p_het_in_parent) + (parent_pp[0] * kid_pp[0])
        # Note: if simplified to be more in line with the autosomal calls, this would be:
        # parent_pp[1] * kid_pp[2] * p_het_in_parent
        p_de_novo = p_data_given_dn / (p_data_given_dn + p_data_given_missed_het)

        parent_tot_read_check = (hl.sum(parent.AD) > 0)
        parent_alt_read_check = (parent.AD[1] / hl.sum(parent.AD)) <= max_parent_ab      

        return hl.bind(solve, p_de_novo, parent_tot_read_check, parent_alt_read_check)

    # Call hemizygous de novo variants on the Y and mitochondrial chromosome (ignoring mito heteroplasmy)
    def call_hemi_y_mt(kid_pp, parent, parent_pp):
        p_data_given_dn = parent_pp[0] * kid_pp[2] * DE_NOVO_PRIOR
        p_het_in_parent = 1 - (1 - prior) ** 1
        p_data_given_missed_het = (parent_pp[2] * kid_pp[2] * 
                                   p_het_in_parent) + (parent_pp[0] * kid_pp[0])
        # Note: if simplified to be more in line with the autosomal calls, this would be:
        # parent_pp[2] * kid_pp[2] * p_het_in_parent
        p_de_novo = p_data_given_dn / (p_data_given_dn + p_data_given_missed_het)

        parent_tot_read_check = (hl.sum(parent.AD) > 0)
        parent_alt_read_check = (parent.AD[1] / hl.sum(parent.AD)) <= max_parent_ab      

        return hl.bind(solve, p_de_novo, parent_tot_read_check, parent_alt_read_check)       
    # Main routine
    de_novo_call = (
        .when(~has_candidate_gt_configuration, failure)
        .when(autosomal, hl.bind(call_auto, kid_pp, dad_pp, mom_pp))
        .when(hemi_x, hl.bind(call_hemi_x, kid_pp, mom, mom_pp))
        .when(hemi_y, hl.bind(call_hemi_y_mt, kid_pp, dad, dad_pp))
        .when(hemi_mt, hl.bind(call_hemi_y_mt, kid_pp, mom, mom_pp))

    tm = tm.annotate_entries(__call=de_novo_call)
    tm = tm.filter_entries(hl.is_defined(tm.__call))
    entries = tm.entries()
    return ('__site_freq',
            .rename({'__site_freq': 'prior'}))