Is there a way to get a list of non-reference heterozygous and homozygous variants by gene for each sample from a MatrixTable?

nistha · January 8, 2024, 12:59pm

I’m pretty new to working with Hail, so I’m a little confused about how to work with the MatrixTables. I would appreciate any advice!

I have a MatrixTable (mt) that looks like this:

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'ancestry_pred': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'filters': set<str>
    'a_index': int32
    'was_split': bool
    'variant_qc': struct {
        gq_stats: struct {
            mean: float64, 
            stdev: float64, 
            min: float64, 
            max: float64
        }, 
        call_rate: float64, 
        n_called: int64, 
        n_not_called: int64, 
        n_filtered: int64, 
        n_het: int64, 
        n_non_ref: int64, 
        het_freq_hwe: float64, 
        p_value_hwe: float64, 
        p_value_excess_het: float64
    }
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32, 
        homozygote_count: array<int32>
    }
    'gvs_afr_af': float64
    'gvs_amr_af': float64
    'gvs_eas_af': float64
    'gvs_eur_af': float64
    'gvs_mid_af': float64
    'gvs_oth_af': float64
    'gvs_sas_af': float64
    'gene_id': str
    'gene_symbol': str
    'consequence': str
    'omim_phenotypes_name': str
    'clinvar_classification': str
    'clinvar_phenotype': str
    'vid': str
----------------------------------------
Entry fields:
    'GT': call
    'GQ': int32
    'RGQ': int32
    'FT': str
    'AD': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

I want to get a table that looks like this, where we can get a list of non-reference heterozygous and homozygous variants (vid) by gene (gene_symbol) for each sample (s):

s	gene_symbol	vid	consequence
1	TESK2	[‘1-3414321-G-A’, ‘1-3414321-T-A’]	[‘missense_variant’, ‘missense_variant’]
1	PEX26	[‘22-18561317-T-A’, ‘22-18561317-TG-A’]	[‘missense_variant’, ‘frameshift_variant’]
2	TESK2	[‘1-3414321-G-A’]	[‘missense_variant’]
3	PEX26	[‘22-18561317-C-A’]	[‘missense_variant’]
…	…	…	…

where vid is the locus.contig - locus.position - ref_allele - alt_allele.

I tried to get this table by doing the following, but when I show the table, it is empty:

# filter for het_non_ref and hom_var
mt = mt.filter_entries(mt.GT.is_het_non_ref() | mt.GT.is_hom_var())

# get entries
pv_table = mt.entries()

# select columns
candidate_genes = pv_table.select(pv_table.s,
                                  pv_table.ancestry_pred,
                                  pv_table.locus,
                                  pv_table.alleles,
                                  pv_table.vid,
                                  pv_table.gene_symbol,
                                  pv_table.gene_id,
                                  pv_table.consequence,
                                 )

# group by gene symbol and sample id 
candidate_genes = candidate_genes.group_by("s", "gene_symbol").aggregate(
    vid = hl.agg.collect(
        candidate_genes.vid
    ),
    consequence = hl.agg.collect(
        candidate_genes.consequence
    )
)

candidate_genes.show()

I thought about using make_table but I have over 1000 samples.
What is the best strategy to get a table like the one above?

Thanks!

Topic		Replies	Views
Extracing sample IDs into a Python List Hail Query & hailctl	0	31	August 8, 2024
Annotate variants with hom var samples Hail Query & hailctl	0	342	January 19, 2023
Filtering MatrixTable for genotype in specific sample Hail Query & hailctl	7	1685	January 8, 2019
Stuck at writing Hail Table Hail Query & hailctl	11	501	February 6, 2023
Calculating the number of variants per column category Hail Query & hailctl	10	658	November 17, 2020

Is there a way to get a list of non-reference heterozygous and homozygous variants by gene for each sample from a MatrixTable?

Related topics