How to plot a value vs variant location

Hi folks—

I’m starting with a matrix table read in from VCF files from the UK Biobank, and I’d like to plot some values (e.g. blood pressure) on the y-axis and the location of the variant on the x-axis. However, I’m not sure how to convert the LocusExpression type to either a NumericExpression for the x data (as required by hail.plot.scatter). Example code:

p = hl.plot.scatter(x=mt.locus.position, 
                    y=mt.sample_data.p4080_i0,
                    xlabel='position',
                    ylabel='p4080_i0',
                    hover_fields={'ID': mt.s},
                    size=8)
show(p)

As it is, I get this ExpressionException error:

2022-09-29 07:20:08 Hail: ERROR: scope violation: ‘MatrixTable.aggregate_rows’ supports aggregation over indices [‘row’]
Found indices [‘row’, ‘column’], with unexpected indices [‘column’]. Invalid fields:
‘sample_data’ (indices [‘column’])
‘s’ (indices [‘column’])

I’m also not sure if the hover_fields line is working, either. I think it might also throw an error since mt.s is a str, and I think hover_fields requires an Expression.

Underneath all this for me I think is a lack of understanding of how (and when) to convert Expressions from one type to another.

Best,
Jeremy

This is definitely a bad error message, but I think the core issue is that the axes of mt.locus.position and mt.sample_data are different – the phenotypes in mt.sample_data, column fields of the matrix table, don’t vary by row.

Could you describe in more detail the plot you want to make, using this formula:

For each _____, plot ___ as the X coordinate and ____ as the Y coordinate.

As an example, I’d describe a manhattan plot as:

For each variant, plot the global position as the X coordinate and the -log10 pValue as the Y coordinate.

Thank you Tim! Your formula was useful for helping me troubleshoot—it looks like I’ll need to do statistics on sample values over samples to make the plot:

For each variant, plot the locus position within chromosome (or global position, doesn’t matter—just some sense of where the variant sits within the gene) as the X coordinate and the median value (ideally with error bars) of p4080_i0 across all samples as the Y coordinate.

I will try to figure out how to do this…

Best,
Jeremy

Hi Tim et al—

Hope you are all doing well. I am just returning to this, and I’m wondering if you can help—your exercise made me realize that I didn’t know exactly what I wanted to plot. I think what I want is the following:

For each variant, plot the locus position within chromosome (or global position, doesn’t matter—just some sense of where the variant sits within the gene) as the X coordinate and the median value (ideally with error bars) of p4079_i0 across samples with 1 or 2 alternate alleles as the Y coordinate.

My difficulty at the moment is that I’m not sure how to filter samples for ones who have alternate alleles.

As a follow-up to this, I’d like also to make an allelic dose-response plot as well: in this case, for each variant I’d want a separate plot with a box-and-whisker corresponding to 0, 1, or 2 alternate alleles (so 3 box-and-whiskers total), with p4079_i0 on the y axis.

Any suggestions would be very helpful!

Best,
Jeremy

Is p4079_i0 an “entry indexed” field? You can check this using mt.describe(). If so, I think you want this:

mt = mt.annotate_rows(
   median_p4079_i0 = hl.agg.filter(mt.GT.n_alt_alleles() > 0, hl.agg.approx_median(mt.p4079_i0))
)

There are a bunch of aggregators you can use: Hail | Aggregators . In terms of plotting, you might use our new ggplot library:

tooltip = hl.str(gwas.locus) + "; " + hl.str(gwas.stats_p4079_i0)

fig = (
    ggplot(gwas, aes(x=gwas.locus, tooltip=tooltip)) + 
    geom_point(aes(y=gwas. stats_p4079_i0))
)
fig.show()

Creating a separate plot per-variant is not something we can easily do massively in parallel. All our plotting relies on local Python operations. You can generate the data for such a plot though:

mt = mt.annotate_rows(
   median_p4079_i0 = hl.agg.filter(mt.GT.n_alt_alleles() > 0, hl.agg.approx_quantiles(mt.p4079_i0, [0, 0.25, 0.5, 0.75, 1])),
   stats_p4079_i0 = hl.agg.filter(mt.GT.n_alt_alleles() > 0, hl.agg.stats(mt.p4079_i0))

)

That first gives you the approximate quantiles which you can use for your box and whisker. The second gives you exact values for mean, standard deviation, and other things which you might find useful.