How to generate Manhattan plot for gene(burden) hail matrix table after regression analysis?

I have a gene(burden) hail matrix table where I condense all variants into genes for easier computation.

mt = mt.annotate_entries(
                mt.variant_qc.AF[1] <= 0.5,
                2 - mt.GT.n_alt_alleles()
gene_mt = mt.group_rows_by(mt.symbol, mt.feature).aggregate(
            mac_HC=hl.agg.sum(hl.if_else(mt.lof == 'HC', mt.mac_adj, 0))

However, group_rows_by doesn’t have option of annotating with locus attribute.
Later on, when I do regression analysis on matrix table and get the result, I am unable to plot Manhattan plot as it requires locus attribute.

AttributeError: Table instance has no field, method, or property ‘locus’
Hint: use ‘describe()’ to show the names of all data fields.

How do you get Manhattan plot for gene matrix table based analysis?

PS: QQ plot is working fine

Hi, @Abhishek . Usually with gene burden tests, we describe our results with a table than with a Manhattan plot. If you do want to create a Manhattan plot, you could perhaps use chr#:start-position for each gene for visualization.

Hi Veerapen,

Yes, I am generating a table separately as well for the results.

Can you provide me the hail code to generate chr#:start-position for each gene as a column, so I can annotate it in my final hail table(results) or gene hail matrix table?

I came up with this(which does not seem to be the optimized version but it works)

var_t = mt.rows()
var_t = var_t.group_by(vep_symbol=var_t.symbol, vep_feature=var_t.feature).aggregate(
            gene_contig=hl.agg.take(, 1)[0],
            gene_ref=hl.agg.take(, 1)[0]

gene_mt = mt.group_rows_by(mt.symbol, mt.feature).aggregate(

gene_mt = gene_mt.annotate_rows(
  [gene_mt.row_key].gene_contig, var_t[gene_mt.row_key].gene_pos,

I think your last approach works, but I’ll say a bit more info about why a manhattan plot isn’t really appropriate here. We plot GWAS results using genomic coordinate as the x-axis and see “skyscrapers” in the results because linkage disequilibrium results in correlated genotypes and therefore correlated p-values in nearby sites. If you plot results per gene in the same way (using start locus or end locus or whatever), you won’t see any of the same structure because the results for nearby genes should not be correlated.

1 Like

What will be a better plot for gene matrix table downstream analysis after regression?

Perhaps a volcano plot where the x-axis would be the effect size (log odds ratio) and y-axis being the -log10(p-value). Was inspired by a presentation by @Siwei this AM who used a similar visualization to present her gene-based test results today.

It sounds like a good plot. Do you have a reference image or link to the presentation by @Siwei ?
I would like to run through by my team.