P-value differ from R for linear regression

It’s a big surprise that I got different p-values from hail and from R. The input data as I manually checked is identical. The hail code is as follows:

mt = mt.annotate_rows(gwas = hl.agg.linreg(hl.float(mt.phenos.height),
[1,
hl.float(mt.phenos.age),
hl.float(mt.phenos.gender),
hl.float(mt.anc0dos.x),
hl.float(mt.anc1dos.x),
hl.float(mt.anc2dos.x),
hl.float(mt.hapcounts0.x),
hl.float(mt.hapcounts1.x)]))

The R code is as follows:

summary(glm(HEIGHT ~ AGE + GENDER + ANC0DOS + ANC1DOS + ANC2DOS + HC0 + HC1)

For one row, the p-value from hail is 0.0995 (beta = -2.19, se = 1.23, t = -1.79), while the p-value from R is 0.086 (beta = -2.19150, se = 1.17954, t = -1.858). I used python statsmodels.api.OLS with constant and got the same p-value as R (0.086).

The fact that the beta values agree with each other suggests there is no input data difference. Not sure why the SE differs from each other.

Thanks for any hope.

Can you try using linear_regression_rows and seeing what you get? Docs here: Hail | Statistics

It will help us understand what’s going on here. It may be that our linreg aggregator has some numerical instability that we need to address.

cannot make `linear_regression_rows work for the error:

scope vialation: ‘linear_regression_rows/covariates’ expects an expression indexed by [‘column’]
Found indices[‘row’, ‘column’], iwth unexpected indices [‘row’]. Invalid fields:
‘anc1dos’ (indices [‘row’, ‘column’])

Forgot to say, the discrepancy occurs when I use 20 samples. If I use 25,000 samples, the p-values are identical to each other if ignoring precision point.

You can’t have entrywise covariates with linear_regression_rows. Just include them as x values.

Could you elaborate a little bit more on “entry wise”? The only difference between the 20 samples & 25k samples is the sample size, all data structures are the same.