Manhattan plot does not match p-value summary statistics

Hello, I did a GWAS with hail and created the above manhattan plot for associations with BMI on chromosome 16. As you can see the p-values look a lot smaller than one would expect. To try to understand what was going on I printed some summary statistics for the p-values using hl.agg.stats:

{'max': 0.999999311371031,
'mean': 0.3411181578621658,
'min': 2.92499235601123e-309,
'n': 2915824,
'stdev': 0.32161619572584055,
'sum': 994640.5115302918}

But these don’t appear to match the graph – for this mean and standard deviation I would expect to see most values below the significance threshold.

Any idea what might be going on here?
Thanks for your help.

Can you share the full python pipeline you ran?

I think most values are indeed below the significance threshold, but they’re overlaid on each other (within the same pixel, basically). I think a QQ plot would be quite illuminating here.

I think most values are indeed below the significance threshold, but they’re overlaid on each other (within the same pixel, basically).

I hadn’t thought of this, but it would make sense. Below is the q-q plot which looks similarly weird:

And here is the pipeline:

import hail as hl
hl.init(default_reference='GRCh37')  

# import chr16 bgen as a MatrixTable
ef = ['GT', 'GP','dosage']
hl.index_bgen("ukb_imp_chr16_v3.bgen",contig_recoding={"01": "1"})
chr1 = hl.import_bgen("ukb_imp_chr16_v3.bgen", entry_fields=ef,sample_file="ukb52480_imp_chr16_v3_s487314.sample")

# read in phenotypes and participant ids to pandas dataframe
import pandas as pd
phenos = pd.read_csv("ukb37493(2).csv", usecols = ["eid","21001-0.0"],dtype={"eid":str,"21001-0.0":float})
# then convert to Hail table object
phenos.columns = ["eid","BMI"]
ph = hl.Table.from_pandas(phenos,key=["eid"])
# remove individuals with NaN BMI values
ph_filtered = ph.filter(hl.is_nan(ph.BMI),keep=False)
pprint(ph.count()-ph_filtered.count())
# join phenotype table onto matrixtable
chr16 = chr16.annotate_cols(phenotype = ph_filtered[chr16.s])

# gwas
gwas = hl.linear_regression_rows(y=chr16.phenotype.BMI,
                             x=chr16.GT.n_alt_alleles(),
                             covariates=[1.0])
gwas.describe()

# manhattan plot
m = hl.plot.manhattan(gwas.p_value)
show(m)

# q-q plot
p = hl.plot.qq(gwas.p_value)
show(p)

Do you think the oversignificance could be something to do with doing a GWAS on only one chromosome? I only did this to practice using hail’s GWAS functionality and realise it is not the usual approach.

It looks to me like your QQ plot is heavily inflated possibly implying that you did not control for confounders. That is likely why you have many significant peaks in your initial Manhattan plot.

I would highly suggest adjusting your model to include possible confounders e.g. PCs and possibly viewing our GWAS tutorial :slight_smile: https://hail.is/docs/0.2/tutorials/01-genome-wide-association-study.html