Annotating parameters required in VQSR or Hard-filtering

Hello, Hail team.

I generated a single sparse MT merging 5k gvcfs using run_combiner() in Hail. In the next step, VQSR or Hard-filtering for getting only high-quality variant sites requires some parameters like QD, MQ, SOR, FS. I would like to know how to get this information in a variant level. In the gnomAD v3 blog, I notice the gnomAD team processed in this filtering step with those parameters.

Welcome any comments about the filtering process for the MT from gvcf-joint-calling.

-Jina

the gnomAD methods repository has some functionality to do this.

Here’s a script I used several months ago to generate VQSR inputs.

import hail as hl
from gnomad.utils.annotations import annotate_adj
from gnomad.sample_qc.relatedness import generate_trio_stats_expr
from gnomad.utils.sparse_mt import *
from gnomad.utils.filtering import filter_to_autosomes
from gnomad.utils.vep import vep_or_lookup_vep
from gnomad.utils.vcf import ht_to_vcf_mt

def compute_info(mt) -> hl.Table:
    mt = mt.filter_rows((hl.len(mt.alleles) > 1))
    mt = mt.transmute_entries(**mt.gvcf_info)

    # Compute AS and site level info expr
    # Note that production defaults have changed:
    # For new releases, the `RAWMQ_andDP` field replaces the `RAW_MQ` and `MQ_DP` fields
    info_expr = get_site_info_expr(
        mt,
        sum_agg_fields=[],
        int32_sum_agg_fields=[],
        array_sum_agg_fields=['SB']
    )
    info_expr = info_expr.annotate(
        **get_as_info_expr(
            mt,
            sum_agg_fields=[],
            int32_sum_agg_fields=[],
            array_sum_agg_fields=['SB']
        )
    )

    # Add AC and AC_raw:
    # First compute ACs for each non-ref allele, grouped by adj
    grp_ac_expr = hl.agg.array_agg(
        lambda ai: hl.agg.filter(
            mt.LA.contains(ai),
            hl.agg.group_by(
                get_adj_expr(mt.LGT, mt.GQ, mt.DP, mt.LAD),
                hl.agg.sum(
                    mt.LGT.one_hot_alleles(mt.LA.map(lambda x: hl.str(x)))[mt.LA.index(ai)]
                )
            )
        ),
        hl.range(1, hl.len(mt.alleles))
    )

    # Then, for each non-ref allele, compute
    # AC as the adj group
    # AC_raw as the sum of adj and non-adj groups
    info_expr = info_expr.annotate(
        AC_raw=grp_ac_expr.map(lambda i: hl.int32(i.get(True, 0) + i.get(False, 0))),
        AC=grp_ac_expr.map(lambda i: hl.int32(i.get(True, 0)))
    )

    info_ht = mt.select_rows(
        info=info_expr
    ).rows()

    # Add lowqual flag
    info_ht = info_ht.annotate(
        lowqual=get_lowqual_expr(
            info_ht.alleles,
            info_ht.info.QUALapprox,
            # The indel het prior used for gnomad v3 was 1/10k bases (phred=40).
            # This value is usually 1/8k bases (phred=39).
            indel_phred_het_prior=40
        )
    )

    return info_ht


mt = hl.read_matrix_table('combiner_out_chr22.mt/')
mt = mt.key_rows_by('locus', 'alleles')
ht = compute_info(mt)
hl.export_vcf(ht_to_vcf_mt(ht), 'vqsr.vcf')

Thank you, Tim @tpoterba. I am sure it would be useful for us. I will look into the script you suggested and apply it to my data.

-Jina

Hi Tim @tpoterba,

Thank you. It works well except the ‘lowqual’ flag. We can get important parameters like FS, ReadPosRankSum, and MQRankSum. But we need more parameters such as MQ, SOR, and QD for filtering.

  1. In the comments in your script, I can see some unfamiliar parameters: RAWMQ_andDP, RAW_MQ, MQ_DP. It seems that these parameters relate to MQ. If you do not mind, could you let me know how to get these parameters for MQ annotation?

  2. I have no idea about QD and SOR. Please give me some advice.

  3. info_ht.info.QUALapprox was not generated in the previous process.

I attached MT.describe() for the output MT of run_combiner() and info_ht.describe(), because it might be helpful for understanding my situation.

I am looking forward to your reply. Thank you.

-Jina

I’m sorry, I didn’t see this follow-up!

I think these parameters that aren’t in your dataset are generated by the version of GATK that is being run recently in production at Broad. Unfortunately I’m not very experienced with running VQSR and what it needs – Do you have definitions for MQ/SOR/QD? I’m sure we can calculate these from the fields we have in the dataset. QD, for instance, could be the sum of GQ divided by the sum of DP per variant?

Hi Tim @tpoterba,

As you said that “We can calculate these from the fields we have in the dataset.”, I also hope I can get these parameters without re-running the Haplotypecaller tool. Actually, we are thinking about Hard-filtering instead of VQSR. But it looks like we still need the same parameters.

Please let me know if you have any related information, as you mentioned about the QD calculation method. Thank you.

-Jina

Yeah, I think we can compute these with aggregations from the existing data in Hail. I just need to know what we should compute. What are the necessary fields for VQSR.

1 Like