I have a gene(burden) hail matrix table where I condense all variants into genes for easier computation.
mt = mt.annotate_entries(
mac_adj=hl.if_else(
mt.variant_qc.AF[1] <= 0.5,
mt.GT.n_alt_alleles(),
2 - mt.GT.n_alt_alleles()
)
)
gene_mt = mt.group_rows_by(mt.symbol, mt.feature).aggregate(
mac=hl.agg.sum(mt.mac_adj),
mac_HC=hl.agg.sum(hl.if_else(mt.lof == 'HC', mt.mac_adj, 0))
However, group_rows_by
doesnât have option of annotating with locus
attribute.
Later on, when I do regression analysis on matrix table and get the result, I am unable to plot Manhattan plot as it requires locus attribute.
AttributeError: Table instance has no field, method, or property âlocusâ
Hint: use âdescribe()â to show the names of all data fields.
How do you get Manhattan plot for gene matrix table based analysis?
PS: QQ plot is working fine
Hi, @Abhishek . Usually with gene burden tests, we describe our results with a table than with a Manhattan plot. If you do want to create a Manhattan plot, you could perhaps use chr#:start-position
for each gene for visualization.
Hi Veerapen,
Yes, I am generating a table separately as well for the results.
Can you provide me the hail code to generate chr#:start-position
for each gene as a column, so I can annotate it in my final hail table(results) or gene hail matrix table?
I came up with this(which does not seem to be the optimized version but it works)
var_t = mt.rows()
var_t = var_t.group_by(vep_symbol=var_t.symbol, vep_feature=var_t.feature).aggregate(
num_variants=hl.agg.count(),
gene_contig=hl.agg.take(var_t.locus.contig, 1)[0],
gene_pos=hl.agg.min(var_t.locus.position),
gene_ref=hl.agg.take(var_t.locus.dtype.reference_genome.name, 1)[0]
)
gene_mt = mt.group_rows_by(mt.symbol, mt.feature).aggregate(
mac=hl.agg.sum(mt.mac_adj)
)
gene_mt = gene_mt.annotate_rows(
locus=hl.locus(var_t[gene_mt.row_key].gene_contig, var_t[gene_mt.row_key].gene_pos,
var_t[gene_mt.row_key].gene_ref.take(1)[0])
)
I think your last approach works, but Iâll say a bit more info about why a manhattan plot isnât really appropriate here. We plot GWAS results using genomic coordinate as the x-axis and see âskyscrapersâ in the results because linkage disequilibrium results in correlated genotypes and therefore correlated p-values in nearby sites. If you plot results per gene in the same way (using start locus or end locus or whatever), you wonât see any of the same structure because the results for nearby genes should not be correlated.
1 Like
What will be a better plot for gene matrix table downstream analysis after regression?
Perhaps a volcano plot where the x-axis would be the effect size (log odds ratio) and y-axis being the -log10(p-value). Was inspired by a presentation by @Siwei this AM who used a similar visualization to present her gene-based test results today.
It sounds like a good plot. Do you have a reference image or link to the presentation by @Siwei ?
I would like to run through by my team.