I’m using hail for a quality control pipeline. I transform a VCF into a hail matrix, run the quality control and then export it as a VCF again.
I’ve noticed that after filtering the gentotypes the reference genotypes (0/0) are exported as missing in the VCF (./.), even if they pass the quality scores.
Am I doing anything incorrectly? Is there a way of avoing this behaviour?
These are the commands that causes it:
if "GQ" in mt.entry:
mt = mt.filter_entries(mt.GQ >= 20)
else:
print("GQ information not available")
if "AD" in mt.entry:
mt = mt.filter_entries(hl.or_else(mt.AD[1] / hl.sum(mt.AD), 0) >= 0.2)
else:
print("AD information not available")
For the data that you have, are GQ and AD defined for all reference genotypes?
If the result of the filter expression is missing, then the entry will be filtered. To avoid this, you’ll need to check to explicitly keep the entry if the field is missing, for example: