Gentoype filtering - missing hom ref data

Hello!

I’m using hail for a quality control pipeline. I transform a VCF into a hail matrix, run the quality control and then export it as a VCF again.

I’ve noticed that after filtering the gentotypes the reference genotypes (0/0) are exported as missing in the VCF (./.), even if they pass the quality scores.

Am I doing anything incorrectly? Is there a way of avoing this behaviour?

These are the commands that causes it:

    if "GQ" in mt.entry:
        mt = mt.filter_entries(mt.GQ >= 20)
    else:
        print("GQ information not available")

    if "AD" in mt.entry:
        mt = mt.filter_entries(hl.or_else(mt.AD[1] / hl.sum(mt.AD), 0) >= 0.2)
    else:
        print("AD information not available")

Thanks!!

For the data that you have, are GQ and AD defined for all reference genotypes?

If the result of the filter expression is missing, then the entry will be filtered. To avoid this, you’ll need to check to explicitly keep the entry if the field is missing, for example:

mt = mt.filter_entries((mt.GT.is_hom_ref() & hl.is_missing(mt.GQ)) | (mt.GQ >= 20))
mt = mt.filter_entries((hl.or_else(mt.AD[1] / hl.sum(mt.AD), 0) >= 0.2) 
                       | (mt.GT.is_hom_ref() & hl.is_missing(mt.AD)))

Thanks for your quick response!!

To avoid filtering the NA entries I ended up with this code:

    # Extract allele balance only for heterozygous genotypes
    mt = mt.annotate_entries(
        AB=hl.or_missing(
            mt.GT.is_het(),
            mt.AD[1] / hl.sum(mt.AD)
        )
    )

    # Apply the filter only to heterozygous genotypes
    mt = mt.filter_entries(
        hl.if_else(
            mt.GT.is_het(),
            mt.AB >= 0.2,
            True  # Keep everything else (hom ref & hom alt unchanged)
        )
    )