De novo calls on hemizygous X variants

Nick · November 17, 2021, 11:26pm

Hi! I’m using hail’s de novo caller (Hail | Genetics) and I was wondering how it handles hemizygous variants on the X chromosome in males? I’ve used the caller to successfully call DNVs on the X chromosome in females and autosomal DNVs, but it doesn’t seem to make hemizygous X chromosome calls in males.

Thanks!

tpoterba · November 18, 2021, 2:24pm

hl.de_novo should support this, but was written to directly translate Kaitlin Samocha’s de novo caller several years ago, and expects that calls on sex chromosomes are (incorrectly) encoded as diploid. Do you have males encoded correctly as hemizygous haploid calls?

tpoterba · November 18, 2021, 2:25pm

If so, we can write a preprocessing annotate_entries to transform this into something compatible with the de_novo algorithm.

Nick · November 18, 2021, 5:21pm

Thanks! The males GT are straight from a GATK calling pipeline and the X chromosome GT are incorrectly encoded as diploid for male samples. What I was hoping for hail to detect is the scenario on the X Chromosome where parents GT are both 0/0 and the male proband is 1/1.

tpoterba · November 18, 2021, 5:22pm

Hmm, I expected it to be working in this case – let me investigate.

kyle · November 23, 2021, 8:55pm

In calling de novos on ASC data, we do get de novos on the X in male probands, but we really don’t get very many (only like 1 per ~750 male probands). One issue is that sperm are more mutable than oocytes, and a male proband is only going to get an X from the oocyte. But one of the things on my to-do list has been to take another look at the algorithm and the filtering I do upstream and see if there’s anything I do that has hamstrung the process (like filtering out het calls in males in hemizygous regions). Tim, I’d be interested to know if you do look at this and anything stands out to you.

tpoterba · November 23, 2021, 9:04pm

It looks like the method is just broken for this configuration. I think if you recode 1/1 male calls to 0/1 in hemizygous regions manually, you will get calls.

kyle · November 23, 2021, 9:31pm

Thanks Tim. I’m taking a closer look at our calls on the X, and I’m noticing that the parents are always 0/0 and the male proband call is always 0/1. Maybe we’re only getting the ones where Hail think it’s pseudoautosomal and so my pipeline didn’t filter the male het call?

In checking the de novo calls we had with the previous dataset on Hail 0.1, I do see several calls with 0/0 parents and a 1/1 male proband. Is there something that could have changed between 0.1 and 0.2?

tpoterba · November 23, 2021, 9:35pm

I don’t think this changed between 0.1 and 0.2, especially with the pretty big randomly generated test, but it’s possible. The algorithm right now doesn’t produce calls (ever) with an 0/0 0/0 1/1 configuration on hemi X males, and I wonder if Kaitlin’s original caller did this or not. Regardless, we should change it (even producing a divergence between the Hail caller and Kaitlin’s) so that we do generate calls in this case.

kyle · November 23, 2021, 9:53pm

OK. Hmm… I’ll have to look at my 0.1 code and the caller code and see if I had done anything to handle that. It’s certainly something I need to handle going forward. And to confirm, I looked at the positions of the few male X de novos that I do have, and they are all pseudoautosomal, so the 0/1 in the proband is legit.

Monkol_Lek · November 23, 2021, 10:38pm

Hey Tim and Kyle, long time no chat. Hope all is well. I’ve been lurking on this thread as Nick is a grad student in my lab.

It would be great to make the changes for male, as it appears to be only looking for 0/0 for parents and 0/1 for child scenarios. It works for autosomes and even female probands on chrX but it breaks for males where 1/1 is observed instead.

We are updating the pipeline used for analysis in the Pediatric Cardiac Genomics Consortium (PCGC) to use hail as its awesome and cool though debatable if the developers are cool.

TL;DR version: cool can you please update de novo calling, happy to test and feedback!

tpoterba · November 23, 2021, 10:44pm

Hope you’re well, Monkol!

Kyle and I were just talking out-of-band about this. The short summary is that the hemizygous calling isn’t working as expected, though it’s still an open question whether these semantics match Kaitlin’s original caller (which I tried to replicate exactly in Hail). It’s not hard to fix this; I should be able to do that tomorrow!

kyle · November 24, 2021, 5:43pm

Hi Monkol! Nice to hear from you. One of the questions that Tim raised when we were talking is whether the ability to call de novos from 0/0 parents and a 1/1 kid is something we would want in general rather than just in hemizygous regions (in case the kid is actually het but happens to have an allele balance that looks homvar). I think I come down on the side of “yes”. I usually use a depth minimum of 10 for autosomal calls, so I don’t think it would get me many new ones, but there might be others who are working with lower thresholds. And if it uncovers a new kind of error mode it would be easy to filter out of the list of calls. Do you have thoughts on this?

kyle · December 2, 2021, 2:22am

Hi all – as an update, I talked about the de novo caller code with Dave Cutler (genetics faculty at Emory and a PI in our consortium), and I’m currently trying it this way (which also incorporates some of Tim’s edits from his pull request):

import hail as hl
from hail.genetics.pedigree import Pedigree
from hail.matrixtable import MatrixTable
from hail.expr import expr_float64
from hail.table import Table
from hail.typecheck import typecheck, numeric
from hail.methods.misc import require_biallelic

@typecheck(mt=MatrixTable,
           pedigree=Pedigree,
           pop_frequency_prior=expr_float64,
           min_gq=int,
           min_p=numeric,
           max_parent_ab=numeric,
           min_child_ab=numeric,
           min_dp_ratio=numeric,
           ignore_in_sample_allele_frequency=bool)
def my_de_novo(mt: MatrixTable,
            pedigree: Pedigree,
            pop_frequency_prior,
            *,
            min_gq: int = 20,
            min_p: float = 0.05,
            max_parent_ab: float = 0.05,
            min_child_ab: float = 0.20,
            min_dp_ratio: float = 0.10,
            ignore_in_sample_allele_frequency: bool = False) -> Table:
  
    DE_NOVO_PRIOR = 1 / 30000000
    MIN_POP_PRIOR = 100 / 30000000

    required_entry_fields = {'GT', 'AD', 'DP', 'GQ', 'PL'}
    missing_fields = required_entry_fields - set(mt.entry)
    if missing_fields:
        raise ValueError(f"'de_novo': expected 'MatrixTable' to have at least {required_entry_fields}, "
                         f"missing {missing_fields}")

    pop_frequency_prior = hl.case() \
        .when((pop_frequency_prior >= 0) & (pop_frequency_prior <= 1), pop_frequency_prior) \
        .or_error(hl.str("de_novo: expect 0 <= pop_frequency_prior <= 1, found " + hl.str(pop_frequency_prior)))

    if ignore_in_sample_allele_frequency:
        # this mode is used when families larger than a single trio are observed, in which
        # an allele might be de novo in a parent and transmitted to a child in the dataset.
        # The original model does not handle this case correctly, and so this experimental
        # mode can be used to treat each trio as if it were the only one in the dataset.
        mt = mt.annotate_rows(__prior=pop_frequency_prior,
                              __alt_alleles=hl.int64(1),
                              __site_freq=hl.max(pop_frequency_prior, MIN_POP_PRIOR))
    else:
        n_alt_alleles = hl.agg.sum(mt.GT.n_alt_alleles())
        total_alleles = 2 * hl.agg.sum(hl.is_defined(mt.GT))
        # subtract 1 from __alt_alleles to correct for the observed genotype
        mt = mt.annotate_rows(__prior=pop_frequency_prior,
                              __alt_alleles=n_alt_alleles,
                              __site_freq=hl.max((n_alt_alleles - 1) / total_alleles,
                                                 pop_frequency_prior,
                                                 MIN_POP_PRIOR))

    mt = require_biallelic(mt, 'de_novo')

    tm = hl.trio_matrix(mt, pedigree, complete_trios=True)
    tm = tm.annotate_rows(__autosomal=tm.locus.in_autosome_or_par(),
                          __x_nonpar=tm.locus.in_x_nonpar(),
                          __y_nonpar=tm.locus.in_y_nonpar(),
                          __mito=tm.locus.in_mito())
    
    autosomal = tm.__autosomal | (tm.__x_nonpar & tm.is_female)
    hemi_x = tm.__x_nonpar & ~tm.is_female
    hemi_y = tm.__y_nonpar & ~tm.is_female
    hemi_mt = tm.__mito

    is_snp = hl.is_snp(tm.alleles[0], tm.alleles[1])
    n_alt_alleles = tm.__alt_alleles
    prior = tm.__site_freq
    
    # Updated for hemizygous child calls to not require a call in uninvolved parent
    has_candidate_gt_configuration = (
        ( autosomal & tm.proband_entry.GT.is_het() & 
          tm.father_entry.GT.is_hom_ref() & tm.mother_entry.GT.is_hom_ref() ) |
        ( hemi_x & tm.proband_entry.GT.is_hom_var() & tm.mother_entry.GT.is_hom_ref() ) |
        ( hemi_y & tm.proband_entry.GT.is_hom_var() & tm.father_entry.GT.is_hom_ref() ) |
        ( hemi_mt & tm.proband_entry.GT.is_hom_var() & tm.mother_entry.GT.is_hom_ref() ) )
    
    failure = hl.null(hl.tstruct(p_de_novo=hl.tfloat64, confidence=hl.tstr))
    
    kid = tm.proband_entry
    dad = tm.father_entry
    mom = tm.mother_entry

    kid_linear_pl = 10 ** (-kid.PL / 10)
    kid_pp = hl.bind(lambda x: x / hl.sum(x), kid_linear_pl)

    dad_linear_pl = 10 ** (-dad.PL / 10)
    dad_pp = hl.bind(lambda x: x / hl.sum(x), dad_linear_pl)

    mom_linear_pl = 10 ** (-mom.PL / 10)
    mom_pp = hl.bind(lambda x: x / hl.sum(x), mom_linear_pl)

    kid_ad_ratio = kid.AD[1] / hl.sum(kid.AD)
    
    # Try to get these all to an expected value of 0.5
    dp_ratio = (hl.case()
                  .when(hemi_x, kid.DP / mom.DP) # Because mom is diploid but kid is not
                  .when(hemi_y, kid.DP / (2*dad.DP))
                  .when(hemi_mt, kid.DP / (2*mom.DP))
                  .when(tm.__x_nonpar & tm.is_female, (kid.DP / (mom.DP + dad.DP)) * (3/4)) # Because of hemi dad
                  .default( kid.DP / (mom.DP + dad.DP) ))

    def solve(p_de_novo, parent_tot_read_check, parent_alt_read_check):
        return (
            hl.case()
            .when(kid.GQ < min_gq, failure)
            .when((dp_ratio < min_dp_ratio)
                  | (kid_ad_ratio < min_child_ab), failure)
            .when(~parent_tot_read_check, failure)
            .when(~parent_alt_read_check, failure)
            .when(p_de_novo < min_p, failure)
            .when(~is_snp, hl.case()
                  .when((p_de_novo > 0.99) & (kid_ad_ratio > 0.3) & (n_alt_alleles == 1),
                        hl.struct(p_de_novo=p_de_novo, confidence='HIGH'))
                  .when((p_de_novo > 0.5) & (kid_ad_ratio > 0.3) & (n_alt_alleles <= 5),
                        hl.struct(p_de_novo=p_de_novo, confidence='MEDIUM'))
                  .when(kid_ad_ratio > 0.2,
                        hl.struct(p_de_novo=p_de_novo, confidence='LOW'))
                  .or_missing())
            .default(hl.case()
                     .when(((p_de_novo > 0.99) & (kid_ad_ratio > 0.3) & (dp_ratio > 0.2))
                           | ((p_de_novo > 0.99) & (kid_ad_ratio > 0.3) & (n_alt_alleles == 1))
                           | ((p_de_novo > 0.5) & (kid_ad_ratio > 0.3) & (n_alt_alleles < 10) & (kid.DP > 10)),
                           hl.struct(p_de_novo=p_de_novo, confidence='HIGH'))
                     .when((p_de_novo > 0.5) & ((kid_ad_ratio > 0.3) | (n_alt_alleles == 1)),
                           hl.struct(p_de_novo=p_de_novo, confidence='MEDIUM'))
                     .when(kid_ad_ratio > 0.2,
                           hl.struct(p_de_novo=p_de_novo, confidence='LOW'))
                     .or_missing()))
    
    # Call autosomal or pseudoautosomal de novo variants
    def call_auto(kid_pp, dad_pp, mom_pp):
        p_data_given_dn = dad_pp[0] * mom_pp[0] * kid_pp[1] * DE_NOVO_PRIOR
        p_het_in_parent = 1 - (1 - prior) ** 4
        p_data_given_missed_het = (dad_pp[1]*mom_pp[0] + dad_pp[0]*mom_pp[1]) * kid_pp[1] * p_het_in_parent
        # Note: a full treatment would include possibility kid genotype is in error:
        # "+ ( (dad_pp[0]*mom_pp[0] + dad_pp[0]*mom_pp[0]) * kid_pp[0] * (1 - p_het_in_parent) )",
        # which is approximately (dad_pp[0]*mom_pp[0] + dad_pp[0]*mom_pp[0]) * kid_pp[0]
        
        p_de_novo = p_data_given_dn / (p_data_given_dn + p_data_given_missed_het)

        parent_tot_read_check = (hl.sum(mom.AD) > 0) & (hl.sum(dad.AD) > 0)
        parent_alt_read_check = ( ((mom.AD[1] / hl.sum(mom.AD)) <= max_parent_ab) &
                                  ((dad.AD[1] / hl.sum(dad.AD)) <= max_parent_ab) )
            
        return hl.bind(solve, p_de_novo, parent_tot_read_check, parent_alt_read_check)

    # Call hemizygous de novo variants on the X
    def call_hemi_x(kid_pp, parent, parent_pp):
        p_data_given_dn = parent_pp[0] * kid_pp[2] * DE_NOVO_PRIOR
        p_het_in_parent = 1 - (1 - prior) ** 2
        p_data_given_missed_het = ((parent_pp[1] + parent_pp[2]) * kid_pp[2] * 
                                   p_het_in_parent) + (parent_pp[0] * kid_pp[0])
        # Note: if simplified to be more in line with the autosomal calls, this would be:
        # parent_pp[1] * kid_pp[2] * p_het_in_parent
        
        p_de_novo = p_data_given_dn / (p_data_given_dn + p_data_given_missed_het)

        parent_tot_read_check = (hl.sum(parent.AD) > 0)
        parent_alt_read_check = (parent.AD[1] / hl.sum(parent.AD)) <= max_parent_ab      

        return hl.bind(solve, p_de_novo, parent_tot_read_check, parent_alt_read_check)

    # Call hemizygous de novo variants on the Y and mitochondrial chromosome (ignoring mito heteroplasmy)
    def call_hemi_y_mt(kid_pp, parent, parent_pp):
        p_data_given_dn = parent_pp[0] * kid_pp[2] * DE_NOVO_PRIOR
        p_het_in_parent = 1 - (1 - prior) ** 1
        p_data_given_missed_het = (parent_pp[2] * kid_pp[2] * 
                                   p_het_in_parent) + (parent_pp[0] * kid_pp[0])
        # Note: if simplified to be more in line with the autosomal calls, this would be:
        # parent_pp[2] * kid_pp[2] * p_het_in_parent
        
        p_de_novo = p_data_given_dn / (p_data_given_dn + p_data_given_missed_het)

        parent_tot_read_check = (hl.sum(parent.AD) > 0)
        parent_alt_read_check = (parent.AD[1] / hl.sum(parent.AD)) <= max_parent_ab      

        return hl.bind(solve, p_de_novo, parent_tot_read_check, parent_alt_read_check)       
    
    # Main routine
    de_novo_call = (
        hl.case()
        .when(~has_candidate_gt_configuration, failure)
        .when(autosomal, hl.bind(call_auto, kid_pp, dad_pp, mom_pp))
        .when(hemi_x, hl.bind(call_hemi_x, kid_pp, mom, mom_pp))
        .when(hemi_y, hl.bind(call_hemi_y_mt, kid_pp, dad, dad_pp))
        .when(hemi_mt, hl.bind(call_hemi_y_mt, kid_pp, mom, mom_pp))
        .or_missing())

    tm = tm.annotate_entries(__call=de_novo_call)
    tm = tm.filter_entries(hl.is_defined(tm.__call))
    entries = tm.entries()
    return (entries.select('__site_freq',
                           'proband',
                           'father',
                           'mother',
                           'proband_entry',
                           'father_entry',
                           'mother_entry',
                           'is_female',
                           **entries.__call)
            .rename({'__site_freq': 'prior'}))

iloverna · April 28, 2022, 4:23pm

Just checking, can I know if hemizygous X variant calling is now supported in Hail natively? I would very much love to see this feature!

niamh · June 30, 2022, 4:32pm

Just reviving this thread, I am interested to this modified function to call hemizygous X variants in males. Can I check if this my_de_novo function functions the same as hl.de_novo (i.e. same output), aside from the hemizygous X male calls? Thank you.

kyle · July 1, 2022, 2:18pm

It’s the same on the autosome. It won’t be exactly the same for females on the X because I tweaked how the depth ratio is calculated outside the PAR.

niamh_murphy · July 6, 2022, 1:09am

Checking in again (sorry misplaced my prev account), but I had some discrepant calls on the autosome as well. I used both hl.de_novo and your custom function with the same parameters, do you have any insights as to why that could be the case?

kyle · September 16, 2022, 5:23pm

I’ve continued to work on this and ought to post another follow-up… here is the version of the caller I’m currently using after more conversation with Dave Cutler and Mark Daly. It no longer is the same as hl.de_novo on the autosome, since I changed how the probability of a parental alt allele is calculated, added terms to account for the possibility that the kid doesn’t actually have a variant, and now have different de novo priors for mom vs dad. It also uses my own function to get AC and AN values (I’ll post that below). In practice this still gives pretty similar output to hl.de_novo, with the addition of calls in hemizygous regions.

# THIS REQUIRES A DATASET WITH mt.AC AND mt.AN VALUES
@typecheck(mt=MatrixTable,
           pedigree=Pedigree,
           pop_frequency_prior=expr_float64,
           min_gq=int,
           min_p=numeric,
           max_parent_ab=numeric,
           min_child_ab=numeric,
           min_dp_ratio=numeric,
           ignore_in_sample_allele_frequency=bool)

def my_de_novo_v16(mt: MatrixTable,
            pedigree: Pedigree,
            pop_frequency_prior,
            *,
            min_gq: int = 20,
            min_p: float = 0.05,
            max_parent_ab: float = 0.05,
            min_child_ab: float = 0.20,
            min_dp_ratio: float = 0.10,
            ignore_in_sample_allele_frequency: bool = False) -> Table:
  
    MOM_DE_NOVO_PRIOR = 1 / 100000000
    DAD_DE_NOVO_PRIOR = 1 / 25000000
    MITO_DE_NOVO_PRIOR = 10 * MOM_DE_NOVO_PRIOR # Rough estimate, could use review for mitochondrial applications
    
    MIN_POP_PRIOR = 100 / 30000000

    required_entry_fields = {'GT', 'AD', 'DP', 'GQ', 'PL'}
    missing_fields = required_entry_fields - set(mt.entry)
    if missing_fields:
        raise ValueError(f"'de_novo': expected 'MatrixTable' to have at least {required_entry_fields}, "
                         f"missing {missing_fields}")

    pop_frequency_prior = hl.case() \
        .when((pop_frequency_prior >= 0) & (pop_frequency_prior <= 1), pop_frequency_prior) \
        .or_error(hl.str("de_novo: expect 0 <= pop_frequency_prior <= 1, found " + hl.str(pop_frequency_prior)))

    if ignore_in_sample_allele_frequency:
        # this mode is used when families larger than a single trio are observed, in which
        # an allele might be de novo in a parent and transmitted to a child in the dataset.
        # The original model does not handle this case correctly, and so this experimental
        # mode can be used to treat each trio as if it were the only one in the dataset.
        mt = mt.annotate_rows(__prior=pop_frequency_prior,
                              __alt_alleles=hl.int64(1),
                              __site_freq=hl.max(pop_frequency_prior, MIN_POP_PRIOR))
    else:
        # Note that this now requires prior annotation with AC and AN values
        # (did this August 2022 to correct for hemizygous situations)
        n_alt_alleles = mt.AC
        total_alleles = mt.AN
        # subtract 1 from __alt_alleles to correct for the observed genotype
        mt = mt.annotate_rows(__prior=pop_frequency_prior,
                              __alt_alleles=n_alt_alleles,
                              __site_freq=hl.max((n_alt_alleles - 1) / total_alleles,
                                                 pop_frequency_prior,
                                                 MIN_POP_PRIOR))

    mt = require_biallelic(mt, 'de_novo')

    tm = hl.trio_matrix(mt, pedigree, complete_trios=True)
    tm = tm.annotate_rows(__autosomal=tm.locus.in_autosome_or_par(),
                          __x_nonpar=tm.locus.in_x_nonpar(),
                          __y_nonpar=tm.locus.in_y_nonpar(),
                          __mito=tm.locus.in_mito())
    
    autosomal = tm.__autosomal | (tm.__x_nonpar & tm.is_female)
    hemi_x = tm.__x_nonpar & ~tm.is_female
    hemi_y = tm.__y_nonpar & ~tm.is_female
    hemi_mt = tm.__mito

    is_snp = hl.is_snp(tm.alleles[0], tm.alleles[1])
    n_alt_alleles = tm.__alt_alleles
    prior = tm.__site_freq
    
    # Updated for hemizygous child calls to not require a call in uninvolved parent
    has_candidate_gt_configuration = (
        ( autosomal & tm.proband_entry.GT.is_het() & 
          tm.father_entry.GT.is_hom_ref() & tm.mother_entry.GT.is_hom_ref() ) |
        ( hemi_x & tm.proband_entry.GT.is_hom_var() & tm.mother_entry.GT.is_hom_ref() ) |
        ( hemi_y & tm.proband_entry.GT.is_hom_var() & tm.father_entry.GT.is_hom_ref() ) |
        ( hemi_mt & tm.proband_entry.GT.is_hom_var() & tm.mother_entry.GT.is_hom_ref() ) )
    
    failure = hl.missing(hl.tstruct(p_de_novo=hl.tfloat64, confidence=hl.tstr))
    
    kid = tm.proband_entry
    dad = tm.father_entry
    mom = tm.mother_entry

    kid_linear_pl = 10 ** (-kid.PL / 10)
    kid_pp = hl.bind(lambda x: x / hl.sum(x), kid_linear_pl)

    dad_linear_pl = 10 ** (-dad.PL / 10)
    dad_pp = hl.bind(lambda x: x / hl.sum(x), dad_linear_pl)

    mom_linear_pl = 10 ** (-mom.PL / 10)
    mom_pp = hl.bind(lambda x: x / hl.sum(x), mom_linear_pl)

    kid_ad_ratio = kid.AD[1] / hl.sum(kid.AD)
    
    # Try to get these all to an expected value of 0.5
    dp_ratio = (hl.case()
                  .when(hemi_x, kid.DP / mom.DP) # Because mom is diploid but kid is not
                  .when(hemi_y, kid.DP / (2*dad.DP))
                  .when(hemi_mt, kid.DP / (2*mom.DP))
                  .when(tm.__x_nonpar & tm.is_female, (kid.DP / (mom.DP + dad.DP)) * (3/4)) # Because of hemi dad
                  .default( kid.DP / (mom.DP + dad.DP) ))

    def solve(p_de_novo, parent_tot_read_check, parent_alt_read_check):
        return (
            hl.case()
            .when(kid.GQ < min_gq, failure)
            .when((dp_ratio < min_dp_ratio)
                  | (kid_ad_ratio < min_child_ab), failure)
            .when(~parent_tot_read_check, failure)
            .when(~parent_alt_read_check, failure)
            .when(p_de_novo < min_p, failure)
            .when(~is_snp, hl.case()
                  .when((p_de_novo > 0.99) & (kid_ad_ratio > 0.3) & (n_alt_alleles == 1),
                        hl.struct(p_de_novo=p_de_novo, confidence='HIGH'))
                  .when((p_de_novo > 0.5) & (kid_ad_ratio > 0.3) & (n_alt_alleles <= 5),
                        hl.struct(p_de_novo=p_de_novo, confidence='MEDIUM'))
                  .when(kid_ad_ratio > 0.2,
                        hl.struct(p_de_novo=p_de_novo, confidence='LOW'))
                  .or_missing())
            .default(hl.case()
                     .when(((p_de_novo > 0.99) & (kid_ad_ratio > 0.3) & (dp_ratio > 0.2))
                           | ((p_de_novo > 0.99) & (kid_ad_ratio > 0.3) & (n_alt_alleles == 1))
                           | ((p_de_novo > 0.5) & (kid_ad_ratio > 0.3) & (n_alt_alleles < 10) & (kid.DP > 10)),
                           hl.struct(p_de_novo=p_de_novo, confidence='HIGH'))
                     .when((p_de_novo > 0.5) & ((kid_ad_ratio > 0.3) | (n_alt_alleles == 1)),
                           hl.struct(p_de_novo=p_de_novo, confidence='MEDIUM'))
                     .when(kid_ad_ratio > 0.2,
                           hl.struct(p_de_novo=p_de_novo, confidence='LOW'))
                     .or_missing()))
    
    # Call autosomal or pseudoautosomal de novo variants
    def call_auto(kid_pp, dad_pp, mom_pp):
        p_data_given_dn_from_mom = mom_pp[0] * dad_pp[0] * kid_pp[1]
        p_dn_from_mom = ((1 - prior) ** 4) * MOM_DE_NOVO_PRIOR * (1 - DAD_DE_NOVO_PRIOR)
        
        p_data_given_dn_from_dad = mom_pp[0] * dad_pp[0] * kid_pp[1]
        p_dn_from_dad = ((1 - prior) ** 4) * (1 - MOM_DE_NOVO_PRIOR) * DAD_DE_NOVO_PRIOR
        
        p_data_given_mom_alt_allele = mom_pp[1] * dad_pp[0] * kid_pp[1]
        p_mom_alt_allele = prior * ((1 - prior) ** 3) * (1 - MOM_DE_NOVO_PRIOR) * (1 - DAD_DE_NOVO_PRIOR)

        p_data_given_dad_alt_allele = mom_pp[0] * dad_pp[1] * kid_pp[1]
        p_dad_alt_allele = prior * ((1 - prior) ** 3) * (1 - MOM_DE_NOVO_PRIOR) * (1 - DAD_DE_NOVO_PRIOR)

        p_data_given_no_alt_alleles = mom_pp[0] * dad_pp[0] * kid_pp[0]
        p_no_alt_alleles = ((1 - prior) ** 4) * (1 - MOM_DE_NOVO_PRIOR) * (1 - DAD_DE_NOVO_PRIOR)
                               
        p_de_novo = ( (p_data_given_dn_from_mom * p_dn_from_mom) + (p_data_given_dn_from_dad * p_dn_from_dad) ) / (
                      (p_data_given_dn_from_mom * p_dn_from_mom) + (p_data_given_dn_from_dad * p_dn_from_dad) +
                      (p_data_given_mom_alt_allele * p_mom_alt_allele) + (p_data_given_dad_alt_allele * p_dad_alt_allele) +
                      (p_data_given_no_alt_alleles * p_no_alt_alleles) )

        parent_tot_read_check = (hl.sum(mom.AD) > 0) & (hl.sum(dad.AD) > 0)
        parent_alt_read_check = ( ((mom.AD[1] / hl.sum(mom.AD)) <= max_parent_ab) &
                                  ((dad.AD[1] / hl.sum(dad.AD)) <= max_parent_ab) )
            
        return hl.bind(solve, p_de_novo, parent_tot_read_check, parent_alt_read_check)

    # Call hemizygous de novo variants on the X
    def call_hemi_x(kid_pp, mom_pp):
        p_data_given_dn_from_mom = mom_pp[0] * kid_pp[2]
        p_dn_from_mom = ((1 - prior) ** 2) * MOM_DE_NOVO_PRIOR

        p_data_given_mom_alt_allele = mom_pp[1] * kid_pp[2]
        p_mom_alt_allele = prior * (1 - prior) * (1 - MOM_DE_NOVO_PRIOR)

        p_data_given_no_alt_alleles = mom_pp[0] * kid_pp[0]
        p_no_alt_alleles = ((1 - prior) ** 2) * (1 - MOM_DE_NOVO_PRIOR)

        p_de_novo = ( (p_data_given_dn_from_mom * p_dn_from_mom) ) / (
                      (p_data_given_dn_from_mom * p_dn_from_mom) +
                      (p_data_given_mom_alt_allele * p_mom_alt_allele) +
                      (p_data_given_no_alt_alleles * p_no_alt_alleles) )

        parent_tot_read_check = (hl.sum(mom.AD) > 0)
        parent_alt_read_check = (mom.AD[1] / hl.sum(mom.AD)) <= max_parent_ab      

        return hl.bind(solve, p_de_novo, parent_tot_read_check, parent_alt_read_check)

    # Call hemizygous de novo variants on the Y
    def call_hemi_y(kid_pp, dad_pp):
        p_data_given_dn_from_dad = dad_pp[0] * kid_pp[2]
        p_dn_from_dad = (1 - prior) * DAD_DE_NOVO_PRIOR

        p_data_given_dad_alt_allele = dad_pp[2] * kid_pp[2]
        p_dad_alt_allele = prior * (1 - DAD_DE_NOVO_PRIOR)

        p_data_given_no_alt_alleles = dad_pp[0] * kid_pp[0]
        p_no_alt_alleles = (1 - prior) * (1 - DAD_DE_NOVO_PRIOR)
        
        p_de_novo = ( (p_data_given_dn_from_dad * p_dn_from_dad) ) / (
                      (p_data_given_dn_from_dad * p_dn_from_dad) +
                      (p_data_given_dad_alt_allele * p_dad_alt_allele) +
                      (p_data_given_no_alt_alleles * p_no_alt_alleles) )

        parent_tot_read_check = (hl.sum(dad.AD) > 0)
        parent_alt_read_check = (dad.AD[1] / hl.sum(dad.AD)) <= max_parent_ab      

        return hl.bind(solve, p_de_novo, parent_tot_read_check, parent_alt_read_check)       
    
    # Call de novo variants on the mitochondrial chromosome 
    # (treating it as a hemizygous situation and ignoring mito heteroplasmy)
    def call_hemi_mt(kid_pp, mom_pp):
        p_data_given_dn_from_mom = mom_pp[0] * kid_pp[2]
        p_dn_from_mom = (1 - prior) * MITO_DE_NOVO_PRIOR

        p_data_given_mom_alt_allele = mom_pp[2] * kid_pp[2]
        p_mom_alt_allele = prior * (1 - MITO_DE_NOVO_PRIOR)

        p_data_given_no_alt_alleles = mom_pp[0] * kid_pp[0]
        p_no_alt_alleles = (1 - prior) * (1 - MITO_DE_NOVO_PRIOR)
        
        p_de_novo = ( (p_data_given_dn_from_mom * p_dn_from_mom) ) / (
                      (p_data_given_dn_from_mom * p_dn_from_mom) +
                      (p_data_given_mom_alt_allele * p_mom_alt_allele) +
                      (p_data_given_no_alt_alleles * p_no_alt_alleles) )

        parent_tot_read_check = (hl.sum(mom.AD) > 0)
        parent_alt_read_check = (mom.AD[1] / hl.sum(mom.AD)) <= max_parent_ab      

        return hl.bind(solve, p_de_novo, parent_tot_read_check, parent_alt_read_check)       

    
    # Main routine
    de_novo_call = (
        hl.case()
        .when(~has_candidate_gt_configuration, failure)
        .when(autosomal, hl.bind(call_auto, kid_pp, dad_pp, mom_pp))
        .when(hemi_x, hl.bind(call_hemi_x, kid_pp, mom_pp))
        .when(hemi_y, hl.bind(call_hemi_y, kid_pp, dad_pp))
        .when(hemi_mt, hl.bind(call_hemi_mt, kid_pp, mom_pp))
        .or_missing())

    tm = tm.annotate_entries(__call=de_novo_call)
    tm = tm.filter_entries(hl.is_defined(tm.__call))
    entries = tm.entries()
    
    return (entries.select('__site_freq',
                           'proband',
                           'father',
                           'mother',
                           'proband_entry',
                           'father_entry',
                           'mother_entry',
                           'is_female',
                           **entries.__call)
            .rename({'__site_freq': 'prior'}))

kyle · September 16, 2022, 5:25pm

This is where I get the AC and AN values from.

# Define sex-aware variant call rate calculation with Hardy-Weinberg p value
def sex_aware_variant_annotations_with_pHWE(mt):
    num_males = mt.aggregate_cols(hl.agg.count_where(mt.is_female == False))
    num_females = mt.aggregate_cols(hl.agg.count_where(mt.is_female == True))
    
    mt = mt.annotate_rows(
        male_hets = hl.agg.count_where(mt.GT.is_het() & (mt.is_female == False)),
        male_homvars = hl.agg.count_where(mt.GT.is_hom_var() & (mt.is_female == False)),
        male_calls = hl.agg.count_where(hl.is_defined(mt.GT) & (mt.is_female == False)),
        female_hets = hl.agg.count_where(mt.GT.is_het() & (mt.is_female == True)),
        female_homvars = hl.agg.count_where(mt.GT.is_hom_var() & (mt.is_female == True)),
        female_calls = hl.agg.count_where(hl.is_defined(mt.GT) & (mt.is_female == True))
    )
    
    mt = mt.annotate_rows(
        call_rate = ( hl.case()
            .when(mt.locus.in_y_nonpar(), (mt.male_calls / num_males))
            .when(mt.locus.in_x_nonpar(), (mt.male_calls + 2*mt.female_calls) / (num_males + 2*num_females))
            .default((mt.male_calls + mt.female_calls) / (num_males + num_females)) ),
        AC = ( hl.case()
            .when(mt.locus.in_y_nonpar(), mt.male_homvars)
            .when(mt.locus.in_x_nonpar(), mt.male_homvars + mt.female_hets + 2*mt.female_homvars)
            .default(mt.male_hets + 2*mt.male_homvars + mt.female_hets + 2*mt.female_homvars) ),
        AN = ( hl.case()
            .when(mt.locus.in_y_nonpar(), mt.male_calls)
            .when(mt.locus.in_x_nonpar(), mt.male_calls + 2*mt.female_calls)
            .default(2*mt.male_calls + 2*mt.female_calls) ),
        pHWE = ( hl.case() 
            .when(mt.locus.in_y_nonpar() | mt.locus.in_mito(), 1.0)
            .when(mt.locus.in_x_nonpar(), hl.hardy_weinberg_test(
                hl.int32(mt.female_calls - mt.female_hets - mt.female_homvars), 
                hl.int32(mt.female_hets), 
                hl.int32(mt.female_homvars)).p_value)
            .default(hl.hardy_weinberg_test(
                hl.int32(mt.male_calls+mt.female_calls - mt.male_hets-mt.female_hets - mt.male_homvars-mt.female_homvars), 
                hl.int32(mt.male_hets + mt.female_hets), 
                hl.int32(mt.male_homvars + mt.female_homvars)).p_value) )
    )
    
    return (mt)

Topic		Replies	Views
De Novo on Y chromosome? Hail Query & hailctl	2	484	April 27, 2022
Male heterozygotes on X chromosome Hail Query & hailctl	0	185	January 11, 2024
UKBiobank chromosome XY Hail Query & hailctl	32	1462	December 10, 2018
Mendel_errors and split_multi_hts Hail Query & hailctl	6	434	October 7, 2020
Diploid to haploid chrX calls Hail Query & hailctl	1	395	August 5, 2020

De novo calls on hemizygous X variants

Related topics