Identifying bilallelic variants

Hi there,

I don’t believe there is a function in hail that can identify bilallelic variants (in trans) based on trio data? If not, please can I suggest a feature request?!

I have a trio dataset generated using the hl.trio_matrix function:

tmp=trio_abr.select_rows(trio_abr.genotypes)
tmp.describe()

----------------------------------------
Global fields:
    'gencodeVersion': str
    'sourceFilePath': str
    'genomeVersion': str
    'sampleType': str
    'hail_version': str
----------------------------------------
Column fields:
    'id': str
    'proband': struct {
        s: str
    }
    'father': struct {
        s: str
    }
    'mother': struct {
        s: str
    }
    'is_female': bool
    'fam_id': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'genotypes': array<struct {
        num_alt: int32, 
        gq: int32, 
        ab: float64, 
        dp: float64, 
        sample_id: str
    }>
----------------------------------------
Entry fields:
    'proband_entry': struct {
        AD: array<int32>, 
        DP: int32, 
        GQ: int32, 
        GT: call, 
        MIN_DP: int32, 
        PGT: call, 
        PID: str, 
        PL: array<int32>, 
        PS: int32, 
        RGQ: int32, 
        SB: array<int32>
    }
    'father_entry': struct {
        AD: array<int32>, 
        DP: int32, 
        GQ: int32, 
        GT: call, 
        MIN_DP: int32, 
        PGT: call, 
        PID: str, 
        PL: array<int32>, 
        PS: int32, 
        RGQ: int32, 
        SB: array<int32>
    }
    'mother_entry': struct {
        AD: array<int32>, 
        DP: int32, 
        GQ: int32, 
        GT: call, 
        MIN_DP: int32, 
        PGT: call, 
        PID: str, 
        PL: array<int32>, 
        PS: int32, 
        RGQ: int32, 
        SB: array<int32>
    }
----------------------------------------
Column key: ['id']
Row key: ['locus', 'alleles']
----------------------------------------

I want to filter specifically for bilallelic variants. To keep things simple for now, I’m just trying to pull out homozygous calls (hom var in proband and het ref in parents).

filter_condition = ((tmp.proband_entry.GT.is_hom_var()) &
    (tmp.mother_entry.GT.is_het_ref()) &
    (tmp.father_entry.GT.is_het_ref()))

tmp_filtered = tmp.filter_entries(filter_condition)

The above gives me NA for everything that isn’t the above condition (I think). So I try to remove NA rows…

tmp_filtered = tmp_filtered.filter_rows(
    hl.agg.all(
        hl.is_missing(tmp_filtered.proband_entry.GT)
    ),keep=False
)


#this shows probands only and not parents
tmp_filtered.select_entries(tmp_filtered.proband_entry.GT).show() 

This looks good to me… but can I check that the ‘NA’ means ‘no call’ in the other probands?

To sanity check I do this:

tmp_filtered=tmp_filtered.explode_rows(tmp_filtered.genotypes)
tmp_filtered.select_rows(tmp_filtered.genotypes).rows().show(20)

Can I check this looks right to an expert please? Am I right in understanding that when I apply the filter_condition that the NA refers to cases that aren’t true?