Manhattan plot does not match p-value summary statistics

ch283 · January 23, 2020, 1:47pm

Hello, I did a GWAS with hail and created the above manhattan plot for associations with BMI on chromosome 16. As you can see the p-values look a lot smaller than one would expect. To try to understand what was going on I printed some summary statistics for the p-values using hl.agg.stats:

{'max': 0.999999311371031,
'mean': 0.3411181578621658,
'min': 2.92499235601123e-309,
'n': 2915824,
'stdev': 0.32161619572584055,
'sum': 994640.5115302918}

But these don’t appear to match the graph – for this mean and standard deviation I would expect to see most values below the significance threshold.

Any idea what might be going on here?
Thanks for your help.

tpoterba · January 23, 2020, 1:51pm

Can you share the full python pipeline you ran?

I think most values are indeed below the significance threshold, but they’re overlaid on each other (within the same pixel, basically). I think a QQ plot would be quite illuminating here.

ch283 · January 23, 2020, 6:20pm

I think most values are indeed below the significance threshold, but they’re overlaid on each other (within the same pixel, basically).

I hadn’t thought of this, but it would make sense. Below is the q-q plot which looks similarly weird:

And here is the pipeline:

import hail as hl
hl.init(default_reference='GRCh37')  

# import chr16 bgen as a MatrixTable
ef = ['GT', 'GP','dosage']
hl.index_bgen("ukb_imp_chr16_v3.bgen",contig_recoding={"01": "1"})
chr1 = hl.import_bgen("ukb_imp_chr16_v3.bgen", entry_fields=ef,sample_file="ukb52480_imp_chr16_v3_s487314.sample")

# read in phenotypes and participant ids to pandas dataframe
import pandas as pd
phenos = pd.read_csv("ukb37493(2).csv", usecols = ["eid","21001-0.0"],dtype={"eid":str,"21001-0.0":float})
# then convert to Hail table object
phenos.columns = ["eid","BMI"]
ph = hl.Table.from_pandas(phenos,key=["eid"])

# remove individuals with NaN BMI values
ph_filtered = ph.filter(hl.is_nan(ph.BMI),keep=False)
pprint(ph.count()-ph_filtered.count())

# join phenotype table onto matrixtable
chr16 = chr16.annotate_cols(phenotype = ph_filtered[chr16.s])

# gwas
gwas = hl.linear_regression_rows(y=chr16.phenotype.BMI,
                             x=chr16.GT.n_alt_alleles(),
                             covariates=[1.0])
gwas.describe()

# manhattan plot
m = hl.plot.manhattan(gwas.p_value)
show(m)

# q-q plot
p = hl.plot.qq(gwas.p_value)
show(p)

Do you think the oversignificance could be something to do with doing a GWAS on only one chromosome? I only did this to practice using hail’s GWAS functionality and realise it is not the usual approach.

kumarveerapen · January 23, 2020, 6:28pm

It looks to me like your QQ plot is heavily inflated possibly implying that you did not control for confounders. That is likely why you have many significant peaks in your initial Manhattan plot.

I would highly suggest adjusting your model to include possible confounders e.g. PCs and possibly viewing our GWAS tutorial https://hail.is/docs/0.2/tutorials/01-genome-wide-association-study.html

Topic		Replies	Views
QQ plot downsampling Hail Query & hailctl	5	499	February 7, 2020
Errors in manhattan plot Hail Query & hailctl	6	435	October 29, 2020
Manhattan plot not showing up Hail Query & hailctl	5	24	April 15, 2025
Qq_plot error FatalError: ClassNotFoundException: is.hail.expr.ir.agg.ApproxCDFCombiner Hail Query & hailctl	5	553	January 19, 2020
Lambda calculation & qq plot for hail Hail Query & hailctl	1	70	October 4, 2024

Manhattan plot does not match p-value summary statistics

Related topics