Extract transmitted/non-transmitted variants

Dear all,

I have a MatrixTable with trio WES information, and I would like to extract transmitted and non-transmitted variants. I am quite new to this, so may I get some advice as to what is the most efficient way to extract them, and in what format? Thank you very much.

Hey @MsUTR !

Have you looked at hl.trio_matrix? It converts a Matrix Table of samples into a Matrix Table of trios which is a bit easier to work with.

Once you have that representation, you can filter to entries where one parent has an alternate allele and the child does not:

mt = mt.annotate_entries(is_untransmitted=
    hl.all(
        hl.any(
            hl.all(mt.father_entry.GT.is_hom_ref(), 
                   ~mt.mother_entry.GT.is_hom_ref()),
            hl.all(~mt.father_entry.GT.is_hom_ref(),
                   mt.mother_entry.GT.is_hom_ref())
        ),
        mt.proband_entry.GT.is_hom_ref()
    )
)

Does that get you cooking? Do you have any other specific questions?

@danking Thank you very much for your response! Can I check if ~ represents something like ā€˜is notā€™?

Also, since I am going to have >1k trios, the resulting entries will be humongous. Do you have any advice on how to export those? I hope to be able to do some burden-based tests.

1 Like

Yeah ~ is the Hail way to say ā€œnotā€. Unfortunately, Python doesnā€™t let us overload the meaning of the keyword not.

Hmm. Thatā€™s a great question. I think in the 1k-10k range youā€™ll be alright with compressed CSV/TSV/VCF (use a file path with ā€œvcf.bgzā€ and Hail will auto-compress the data on export). Above When you get to 100k youā€™ll start to have trouble with a dense matrix of genotypes in VCF format (VCFs scale super-linearly), but if you drop everything other than the GTs and these transmission annotations, I think it might be OK.

Iā€™d also recommend using .export(..., parallel="header_per_shard") to get one file per partition of the Hail matrix table if you can work with that output.

What kind of burden test are you performing? Hail supports SKAT, linear, logistic, and poisson regression directly on whatever annotation youā€™d like to define (meaning you can sum, mean, stddev, etc. the genotypes in the gene before the test).

Thank you for your response. I am planning to perform a gene-level burden test based on an expectation model. Looking at the Hail statistical methods, a lot of them assess the enrichment of rows (which I believe are variants). How can I aggregate the variants per gene for these methods?

1 Like

Indeed, rows are variants if you imported a VCF.

Assuming you have a gene annotation (you can either get your own gene interval list file or use gencode from the annotation database):

mt = mt.group_rows_by(
    mt.gene_name
).aggregate(
    n_alternates = hl.agg.sum(mt.GT.n_alt_alleles())
)

You can do all sorts of funky aggregations though, like:

).aggregate(
    has_two_or_more_hom_alt = hl.agg.count_where(mt.GT.is_hom_alt()) >= 2
)
).aggregate(
    has_exactly_two_hets = hl.agg.count_where(mt.GT.is_het()) == 2
)
).aggregate(
    damaging_allele_count = hl.agg.filter(
        hl.literal({"stop_gained", ...}).contains(mt.vep_consequence),
        hl.agg.sum(mt.GT.n_alt_alleles())
    )
)

You might find the visuals in the cheatsheets helpful.

Ah, thatā€™s a cool workaround! With that, I can probably use the functions here Hail | Statistics. Also, looking at the transmission_disequilibrium_test method, is it solely intended for variant-level analysis, or is there a way to collapse it by gene-level as you had shown above?

1 Like

Iā€™m not a geneticist, but I donā€™t think the TDT method, as written, makes sense for gene aggregated data. You can take a look at the source code for TDT.

Hail has support for a bunch of statistical distributions. The TDT method uses these internally:

tab = tab.annotate(chi_sq=((tab.t - tab.u) ** 2) / (tab.t + tab.u))
tab = tab.annotate(p_value=hl.pchisqtail(tab.chi_sq, 1.0))

Just as a followup to this, when I filter for entries with the conditions above, can I confirm that it does not account for the scenario where it is homozygous reference in the proband, but heterozygous in both parents?

1 Like

I believe youā€™re referring to this:

mt = mt.annotate_entries(is_untransmitted=
    hl.all(
        hl.any(
            hl.all(mt.father_entry.GT.is_hom_ref(), 
                   ~mt.mother_entry.GT.is_hom_ref()),
            hl.all(~mt.father_entry.GT.is_hom_ref(),
                   mt.mother_entry.GT.is_hom_ref())
        ),
        mt.proband_entry.GT.is_hom_ref()
    )
)

If the proband is homozygous reference and the parents are both heterozygous, then that entry will have is_untransmitted=False.

FWIW, if you want to consider this case, Iā€™d probably do it in pieces. I find this a bit easier to reason about when Iā€™m coming into new code or coming back to code I wrote a while ago.

mt = mt.annotate_entries(
    kid_hom_one_parent_hom=...
    kid_hom_both_parents_het=hl.all(
        mt.father_entry.GT.is_het(),
        mt.mother_entry.GT.is_het(),
        hl.any(
            mt.proband_entry.GT.is_hom_ref(),
            mt.proband_entry.GT.is_hom_alt()))
...
)
mt = mt.annotate_entries(
    is_untransmitted = hl.any(
        mt.kid_hom_one_parent_hom,
        mt.kid_hom_both_parents_het,
        ...
)

For future readers with slightly different expressions, hereā€™s how I would confirm this for myself by evaluating it step-wise. The parents are both heterozygous so is_hom_ref() will return false:

    hl.all(
        hl.any(
            hl.all(False, 
                   ~False),
            hl.all(~False,
                   False)
        ),
        mt.proband_entry.GT.is_hom_ref()
    )

hl.all is like hl.and: if even one value is False, it is False:

    hl.all(
        hl.any(False, False),
        mt.proband_entry.GT.is_hom_ref()
    )

Likewise, any is like or it needs at least one True to be True:

    hl.all(
        False,
        mt.proband_entry.GT.is_hom_ref()
    )

And, again, this resolves to False.