P-value differ from R for linear regression

mocksu · March 16, 2021, 2:28pm

It’s a big surprise that I got different p-values from hail and from R. The input data as I manually checked is identical. The hail code is as follows:

mt = mt.annotate_rows(gwas = hl.agg.linreg(hl.float(mt.phenos.height),
[1,
hl.float(mt.phenos.age),
hl.float(mt.phenos.gender),
hl.float(mt.anc0dos.x),
hl.float(mt.anc1dos.x),
hl.float(mt.anc2dos.x),
hl.float(mt.hapcounts0.x),
hl.float(mt.hapcounts1.x)]))

The R code is as follows:

summary(glm(HEIGHT ~ AGE + GENDER + ANC0DOS + ANC1DOS + ANC2DOS + HC0 + HC1)

For one row, the p-value from hail is 0.0995 (beta = -2.19, se = 1.23, t = -1.79), while the p-value from R is 0.086 (beta = -2.19150, se = 1.17954, t = -1.858). I used python statsmodels.api.OLS with constant and got the same p-value as R (0.086).

The fact that the beta values agree with each other suggests there is no input data difference. Not sure why the SE differs from each other.

Thanks for any hope.

johnc1231 · March 17, 2021, 3:14pm

Can you try using linear_regression_rows and seeing what you get? Docs here: Hail | Statistics

johnc1231 · March 18, 2021, 5:13pm

It will help us understand what’s going on here. It may be that our linreg aggregator has some numerical instability that we need to address.

mocksu · March 18, 2021, 7:35pm

cannot make `linear_regression_rows work for the error:

scope vialation: ‘linear_regression_rows/covariates’ expects an expression indexed by [‘column’]
Found indices[‘row’, ‘column’], iwth unexpected indices [‘row’]. Invalid fields:
‘anc1dos’ (indices [‘row’, ‘column’])

mocksu · March 18, 2021, 7:36pm

Forgot to say, the discrepancy occurs when I use 20 samples. If I use 25,000 samples, the p-values are identical to each other if ignoring precision point.

johnc1231 · March 26, 2021, 7:20pm

You can’t have entrywise covariates with linear_regression_rows. Just include them as x values.

mocksu · March 31, 2021, 12:49am

Could you elaborate a little bit more on “entry wise”? The only difference between the 20 samples & 25k samples is the sample size, all data structures are the same.

Topic		Replies	Views
Parsing results from regression on multiple phenotypes Hail Query & hailctl	0	12	April 25, 2025
Results from linear regression are array instead of float Hail Query & hailctl	3	401	October 29, 2021
Linear regression explanation Hail Query & hailctl	7	300	June 27, 2023
Clarification on Linear Model in Hail: genetic relatedness and covariate Hail Query & hailctl	0	19	March 28, 2025
P-value "bands" in linear regression Hail Query & hailctl	2	393	April 5, 2023

P-value differ from R for linear regression

Related topics