Modifying Specific Variant GT

Hello,

I am in the process of trying to make some changes to our VCF genotypes that were recommended to our group. For those genotypes where the Reason for No Call (RNC) is equal to 1 (where ‘1’ represents a monoallelic site’), we hope to convert those GT sites to REF/REF or ‘0/0’.

Towards this goal, I was thinking of adding a new definition to our code’s existing dictionary. The last (and newest) definition would flag genotypes where the RNC is equal to ‘11’:

# dict of lambdas for filter expressions
    genotype_exprs = {
        # DP: Genotypes below reading depth
        'GDP' : lambda ds: ds.DP < args.depth_threshold,
        # GDP: heterozygous genotypes outside of allelic balance
        'GAB' : lambda ds: (
            ds.GT.is_het() & (
                (ds.AD[0]/ds.DP > max(1 - args.allelic_balance, args.allelic_balance))
                | (ds.AD[0]/ds.DP < min(1 - args.allelic_balance, args.allelic_balance))
            )
        ),
        # GGQ: Genotypes below quality threshold
        'GGQ': lambda ds: ds.GQ < args.genotype_quality,
        # GNP: Male het chrX nonpar variants
        'GNP': lambda ds: ds.GT.is_het() & ds.locus.in_x_nonpar() & (ds.sa.GENDER == "Male"),
        # GRNC: Genotypes with RNC == 1
        'GRNC': lambda ds: ds.RNC[0] == "11"
    }

From here, we would 1) change the GT to ‘0/0’ for RNC = 11 variants and 2) filter out the flagged variants as shown below

    # Apply filters again to actually remove genotypes
    vcf = vcf.annotate_entries(
    GT=hl.if_else(genotype_exprs['GRNC'](vcf), hl.Call([0, 0]), vcf.GT)
    ).filter_entries(
    genotype_exprs['GDP'](vcf) | genotype_exprs['GAB'](vcf) | genotype_exprs['GGQ'](vcf) | genotype_exprs['GNP'](vcf), keep=False)

However, this generates the following error in its current state. Can anyone suggest an alternative method of performing this change?

Traceback (most recent call last):
  File "/data/Segre_Lab/users/mgarcia/variant_qc_tool_prep_UKB_WES/hail_variant_qc_v4.py", line 550, in <module>
    GT=hl.if_else(genotype_exprs['GRNC'](vcf), '0/0', vcf.GT)
  File "<decorator-gen-738>", line 2, in if_else
  File "/gpfs/fs1/home/mg1646/.conda/envs/variantqc_hailenv/lib/python3.8/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/gpfs/fs1/home/mg1646/.conda/envs/variantqc_hailenv/lib/python3.8/site-packages/hail/expr/functions.py", line 410, in if_else
    raise TypeError(f"'if_else' and 'cond' require the 'consequent' and 'alternate' arguments to have the same type\n"
TypeError: 'if_else' and 'cond' require the 'consequent' and 'alternate' arguments to have the same type
    consequent: type 'str'
    alternate:  type 'call'

The section with the error has a different code than the one you are showing above it. In the section with the error you have the consequent section of the if_else as the string 0/0. It should be hl.Call([0,0]).

Hello @anh151 ,

Thank you for your response, and I apologize for the error in what I shared. The error I showed above is from a version where the consequent section of our if_else statement indeed had ‘0/0’.

More recently, we tried the suggestion you provided:

    vcf = vcf.annotate_entries(
    GT=hl.if_else(genotype_exprs['GRNC'](vcf), hl.Call([0, 0]), vcf.GT)
    ).filter_entries(
    genotype_exprs['GDP'](vcf) | genotype_exprs['GAB'](vcf) | genotype_exprs['GGQ'](vcf) | genotype_exprs['GNP'](vcf), keep=False)

However, we still encountered the following error:

Traceback (most recent call last):
  File "/data/Segre_Lab/users/mgarcia/variant_qc_tool_prep_UKB_WES/hail_variant_qc_v4.py", line 549, in <module>
    vcf = vcf.annotate_entries(
  File "<decorator-gen-1226>", line 2, in filter_entries
  File "/gpfs/fs1/home/mg1646/.conda/envs/variantqc_hailenv/lib/python3.8/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/gpfs/fs1/home/mg1646/.conda/envs/variantqc_hailenv/lib/python3.8/site-packages/hail/matrixtable.py", line 1917, in filter_entries
    analyze('MatrixTable.filter_entries', expr, self._entry_indices)
  File "<decorator-gen-708>", line 2, in analyze
  File "/gpfs/fs1/home/mg1646/.conda/envs/variantqc_hailenv/lib/python3.8/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/gpfs/fs1/home/mg1646/.conda/envs/variantqc_hailenv/lib/python3.8/site-packages/hail/expr/expressions/expression_utils.py", line 130, in analyze
    raise errors[0]
hail.expr.expressions.base_expression.ExpressionException: 'MatrixTable.filter_entries': source mismatch
  Expected an expression from source <hail.matrixtable.MatrixTable object at 0x2ad5a3f59820>
  Found expression derived from source <hail.matrixtable.MatrixTable object at 0x2ad5a82dea90>
  Problematic field(s): ['DP', 'GT', 'AD', 'GQ', 'locus', 'sa', 'RNC']

  This error is commonly caused by chaining methods together:
    >>> ht.distinct().select(ht.x)

  Correct usage:
    >>> ht = ht.distinct()
    >>> ht = ht.select(ht.x)

Do you have any understanding of why this is?

You need to split up the annotate_entries and filter_entries methods into 2 lines. Your filter_entries method is referring to the vcf variable which has not yet been annotated.

vcf = vcf.annotate_entries(
    GT=hl.if_else(genotype_exprs['GRNC'](vcf), hl.Call([0, 0]), vcf.GT)
    )
vcf = vcf.filter_entries(
    genotype_exprs['GDP'](vcf) | genotype_exprs['GAB'](vcf) | genotype_exprs['GGQ'](vcf) | genotype_exprs['GNP'](vcf), keep=False)

Hello @anh151 ,

This took care of my issue. Thank you for your help!