I’m pretty new to working with Hail, so I’m a little confused about how to work with the MatrixTables. I would appreciate any advice!
I have a MatrixTable (mt
) that looks like this:
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
'ancestry_pred': str
----------------------------------------
Row fields:
'locus': locus<GRCh38>
'alleles': array<str>
'filters': set<str>
'a_index': int32
'was_split': bool
'variant_qc': struct {
gq_stats: struct {
mean: float64,
stdev: float64,
min: float64,
max: float64
},
call_rate: float64,
n_called: int64,
n_not_called: int64,
n_filtered: int64,
n_het: int64,
n_non_ref: int64,
het_freq_hwe: float64,
p_value_hwe: float64,
p_value_excess_het: float64
}
'info': struct {
AC: array<int32>,
AF: array<float64>,
AN: int32,
homozygote_count: array<int32>
}
'gvs_afr_af': float64
'gvs_amr_af': float64
'gvs_eas_af': float64
'gvs_eur_af': float64
'gvs_mid_af': float64
'gvs_oth_af': float64
'gvs_sas_af': float64
'gene_id': str
'gene_symbol': str
'consequence': str
'omim_phenotypes_name': str
'clinvar_classification': str
'clinvar_phenotype': str
'vid': str
----------------------------------------
Entry fields:
'GT': call
'GQ': int32
'RGQ': int32
'FT': str
'AD': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
I want to get a table that looks like this, where we can get a list of non-reference heterozygous and homozygous variants (vid
) by gene (gene_symbol
) for each sample (s
):
s | gene_symbol | vid | consequence |
---|---|---|---|
1 | TESK2 | [‘1-3414321-G-A’, ‘1-3414321-T-A’] | [‘missense_variant’, ‘missense_variant’] |
1 | PEX26 | [‘22-18561317-T-A’, ‘22-18561317-TG-A’] | [‘missense_variant’, ‘frameshift_variant’] |
2 | TESK2 | [‘1-3414321-G-A’] | [‘missense_variant’] |
3 | PEX26 | [‘22-18561317-C-A’] | [‘missense_variant’] |
… | … | … | … |
where vid
is the locus.contig
- locus.position
- ref_allele
- alt_allele
.
I tried to get this table by doing the following, but when I show the table, it is empty:
# filter for het_non_ref and hom_var
mt = mt.filter_entries(mt.GT.is_het_non_ref() | mt.GT.is_hom_var())
# get entries
pv_table = mt.entries()
# select columns
candidate_genes = pv_table.select(pv_table.s,
pv_table.ancestry_pred,
pv_table.locus,
pv_table.alleles,
pv_table.vid,
pv_table.gene_symbol,
pv_table.gene_id,
pv_table.consequence,
)
# group by gene symbol and sample id
candidate_genes = candidate_genes.group_by("s", "gene_symbol").aggregate(
vid = hl.agg.collect(
candidate_genes.vid
),
consequence = hl.agg.collect(
candidate_genes.consequence
)
)
candidate_genes.show()
I thought about using make_table
but I have over 1000 samples.
What is the best strategy to get a table like the one above?
Thanks!