Results from linear regression are array instead of float

Hello - I am following using the methods suggested to me in this thread.

Briefly, I have 79800 gwases to run on ~11.5k samples. For efficiency, @tpoterba suggested that I run the gwas in batches. Just to make sure I know what’s happening and understand the output, I’m testing the batch suggestion, but only for a batch size of 1. The gwas results are confusing, though. For each row (eg, each rsid), metrics like beta, t_stat, and p_value are arrays instead of a single float value. It makes sense that they would be an array of one, since the subset method enters an array as the x side of the equation. So that when I expand the batches to larger sizes, the size arrays in the results would increase as well.

However, for the gwas using an array of 1, instead of having an array containing 1 p_value per rsid, the results are much larger:

gwas.p_value.show()

locus alleles p_value
locus<GRCh37>	array<str>	array<float64>
1:57222	["T","C"]	[9.16e-01,3.23e-01,9.61e-01,6.96e-02,7.45e-01,2.25e-01,8.18e-01,6.95e-01,8.49e-01,7.77e-01,5.69e-01,6.39e-01,5.56e-01,3.00e-01,3.79e-01,3.67e-01,7.50e-01,9.23e-01,6.73e-01,3.68e-01,5.24e-01,9.56e-01,7.51e-01,9.52e-01,7.38e-01,7.92e-01,4.32e-01,3.54e-01,1.86e-01,8.24e-01,4.98e-01,9.96e-01]
1:60433	["A","G"]	[3.95e-01,8.55e-01,4.19e-01,8.79e-01,4.42e-01,5.34e-01,9.38e-01,9.58e-01,8.13e-01,2.02e-01,8.70e-02,1.28e-01,6.13e-01,4.38e-01,3.25e-01,9.40e-01,7.38e-01,1.69e-01,5.13e-01,6.35e-01,9.72e-01,5.15e-01,8.16e-01,9.57e-01,6.89e-01,7.80e-01,7.98e-01,4.36e-01,8.26e-01,3.92e-01,1.46e-01,4.10e-01]
1:63336	["C","T"]	[6.86e-03,5.18e-01,7.90e-01,2.06e-01,9.57e-01,1.83e-02,1.10e-01,3.83e-01,5.97e-01,8.65e-01,7.18e-01,4.94e-01,9.17e-01,7.28e-01,6.74e-01,4.81e-01,8.48e-01,3.96e-01,1.95e-01,5.45e-01,5.45e-01,5.85e-01,8.27e-01,5.65e-01,8.05e-01,3.54e-01,7.87e-01,3.99e-01,7.90e-01,8.11e-01,7.77e-01,7.53e-01]
1:64795	["A","G"]	[4.99e-01,3.70e-01,6.01e-01,6.84e-01,2.80e-01,9.85e-01,1.63e-01,4.68e-01,2.57e-01,1.99e-01,1.68e-02,1.49e-01,1.04e-01,2.90e-01,7.12e-02,9.15e-01,3.78e-01,8.55e-01,6.45e-01,4.06e-01,6.85e-01,3.73e-01,4.19e-01,4.27e-01,4.85e-01,9.05e-01,1.81e-01,6.36e-01,5.42e-03,1.17e-02,2.44e-02,6.38e-03]
1:76897	["G","A"]	[6.84e-01,9.51e-01,7.86e-01,9.18e-01,4.77e-01,9.33e-01,8.41e-01,9.86e-01,9.81e-01,4.30e-01,9.00e-01,8.21e-01,6.45e-01,5.33e-01,8.64e-01,6.66e-01,9.45e-01,9.83e-01,8.95e-01,6.13e-01,8.00e-01,8.60e-01,6.95e-01,5.17e-01,7.58e-01,7.07e-01,9.62e-01,9.11e-01,7.05e-01,9.80e-01,7.61e-01,7.71e-01]
1:77874	["G","A"]	[7.49e-01,3.76e-01,4.87e-01,8.39e-01,6.69e-01,7.06e-01,9.07e-01,5.75e-01,5.09e-01,6.23e-01,7.95e-01,4.60e-01,3.82e-01,4.70e-01,5.28e-01,3.52e-01,1.04e-01,7.42e-01,2.46e-01,8.70e-01,5.84e-01,5.00e-01,6.42e-01,5.77e-01,5.74e-01,9.84e-01,6.26e-01,4.79e-01,9.22e-01,8.09e-01,7.74e-01,7.90e-01]
1:82036	["A","T"]	[9.96e-01,5.29e-01,8.02e-02,4.27e-01,5.81e-01,3.29e-01,4.27e-01,3.76e-01,7.48e-01,7.44e-01,4.84e-01,9.67e-01,8.68e-01,2.44e-01,1.57e-01,6.57e-01,8.45e-01,3.74e-01,9.18e-01,1.20e-01,1.26e-01,6.82e-01,9.82e-01,6.96e-01,3.02e-01,6.63e-01,1.89e-01,1.50e-01,6.35e-01,8.79e-01,4.25e-01,3.15e-01]
1:84156	["A","C"]	[5.12e-01,4.30e-01,7.06e-01,6.67e-01,7.12e-01,6.01e-01,8.78e-01,9.07e-01,3.83e-01,3.43e-01,2.12e-01,5.10e-01,5.31e-01,8.36e-01,7.93e-01,3.02e-01,5.48e-01,7.42e-01,7.59e-01,7.30e-01,2.50e-01,9.14e-01,5.08e-01,4.62e-01,5.82e-01,1.37e-01,4.75e-01,6.54e-01,1.12e-01,4.17e-01,9.72e-01,5.91e-01]
1:85107	["A","G"]	[2.13e-01,1.08e-01,2.36e-01,6.61e-01,5.95e-01,8.04e-01,7.84e-01,3.97e-01,2.83e-01,3.15e-02,1.22e-01,9.26e-01,9.63e-01,3.83e-01,6.57e-01,6.66e-02,4.54e-01,5.11e-01,3.96e-01,4.96e-01,2.06e-01,5.76e-01,7.56e-01,7.55e-01,5.18e-01,4.35e-02,3.34e-01,8.56e-01,9.66e-02,7.02e-01,7.92e-01,3.56e-01]
1:85941	["C","T"]	[3.65e-01,4.20e-01,9.31e-01,5.15e-01,8.29e-01,5.80e-01,5.15e-01,1.71e-01,6.57e-02,1.30e-01,4.23e-01,7.52e-01,9.14e-01,9.60e-01,4.93e-01,1.27e-01,6.44e-02,5.14e-02,5.75e-02,5.78e-01,2.23e-02,6.29e-02,1.38e-01,4.09e-01,2.97e-01,8.42e-01,4.43e-01,5.48e-01,1.97e-02,2.58e-01,2.36e-01,8.16e-01]

Here is the code I used to generate these results:

batch_size = 1
batch_index = 0

# read in the pheno matrix table
phenos_mt = hl.read_matrix_table('idps_i_deconf.mt')
phenos_mt = phenos_mt.add_col_index()

# find the start and stop columns of the idps
pheno_start = batch_size * batch_index
pheno_end = pheno_start + batch_size

# filter the idps of interest
phenos_mt = phenos_mt.filter_cols((phenos_mt.col_idx >= pheno_start) & (phenos_mt.col_idx < pheno_end))
phenos_table = phenos_mt.annotate_rows(phenos = hl.agg.collect(phenos_mt.x)).rows()

# merge the idps with the genetic data
geno_subset_mt = geno_subset_mt.annotate_cols(phenos = phenos_table[geno_subset_mt.s].phenos)

# run the gwas
gwas = hl.linear_regression_rows(
    [geno_subset_mt.phenos[i] for i in range(batch_size)],
    geno_subset_mt.dosage,
    covariates=[1.0],
    block_size=128)

Thanks in advance!

Hmm. The following works for me. Does anything stand out to use as particularly different between the fake matrix tables I used and your matrix tables? What does geno_subset_mt.describe() print just before the regression? What version of Hail are you using?

In [14]: batch_size = 1 
    ...: batch_index = 0 
    ...:  
    ...: # read in the pheno matrix table 
    ...: phenos_mt = hl.utils.range_matrix_table(100, 10) 
    ...: phenos_mt = phenos_mt.annotate_entries(x=phenos_mt.row_idx * phenos_mt.col_idx) 
    ...:  
    ...: # find the start and stop columns of the idps 
    ...: pheno_start = batch_size * batch_index 
    ...: pheno_end = pheno_start + batch_size 
    ...:  
    ...: # filter the idps of interest 
    ...: phenos_mt = phenos_mt.filter_cols((phenos_mt.col_idx >= pheno_start) & (phenos_mt.col_idx < pheno_end)) 
    ...: phenos_table = phenos_mt.annotate_rows(phenos = hl.agg.collect(phenos_mt.x)).rows() 
    ...:  
    ...: # merge the idps with the genetic data 
    ...: geno_subset_mt = hl.balding_nichols_model(1, 10, 10) 
    ...: geno_subset_mt = geno_subset_mt.add_col_index('s') 
    ...: geno_subset_mt = geno_subset_mt.key_cols_by(s=hl.int32(geno_subset_mt.s)) 
    ...: geno_subset_mt = geno_subset_mt.annotate_cols(phenos = phenos_table[geno_subset_mt.s].phenos) 
    ...:  
    ...: # run the gwas 
    ...: gwas = hl.linear_regression_rows( 
    ...:     [geno_subset_mt.phenos[i] for i in range(batch_size)], 
    ...:     geno_subset_mt.GT.n_alt_alleles(), 
    ...:     covariates=[1.0], 
    ...:     block_size=128) 
    ...: hl.len(gwas.p_value).show()                                                                                                           
2021-10-29 12:21:31 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 10 samples, and 10 variants...
2021-10-29 12:21:31 Hail: INFO: Coerced sorted dataset
2021-10-29 12:21:31 Hail: INFO: linear_regression_rows: running on 10 samples for 1 response variable y,
    with input variable x, and 1 additional covariate...
+---------------+------------+--------+
| locus         | alleles    | <expr> |
+---------------+------------+--------+
| locus<GRCh37> | array<str> |  int32 |
+---------------+------------+--------+
| 1:1           | ["A","C"]  |      1 |
| 1:2           | ["A","C"]  |      1 |
| 1:3           | ["A","C"]  |      1 |
| 1:4           | ["A","C"]  |      1 |
| 1:5           | ["A","C"]  |      1 |
| 1:6           | ["A","C"]  |      1 |
| 1:7           | ["A","C"]  |      1 |
| 1:8           | ["A","C"]  |      1 |
| 1:9           | ["A","C"]  |      1 |
| 1:10          | ["A","C"]  |      1 |
+---------------+------------+--------+

I’m using Hail 0.2.64. Here is the output of geno_subset_mt.describe()

Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'phenos': array<float64>
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'varid': str
    'mfi': struct {
        bp: str, 
        info: float64
    }
----------------------------------------
Entry fields:
    'dosage': float64
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

Hmm. I really have no explanation for that. I installed Hail 0.2.64 and ran the aforementioned test-case and got all length 1 p-values.

The code pretty clearly will only produce more than one p-value if there is more than one y-value. Maybe try changing your code to this to verify everything is the length we expect.

y_values = [geno_subset_mt.phenos[i] for i in range(batch_size)]
assert len(y_values) == 1, (y_values, batch_size)
assert all(isinstance(y, hl.Expression) for y in y_values), y_values
gwas = hl.linear_regression_rows( 
    y_values, 
    geno_subset_mt.GT.n_alt_alleles(), 
    covariates=[1.0], 
    block_size=128) 

This value controls the length of the p_value array. You could try editing your installation of Hail to print the length of that array before it calls ir.MatrixToTableApply.

Are all the other fields (t_stat, standard_error, beta, and y_transpose_x) also length 32? If you open the matrix table and table folders that you’re using there’s a file metadata.json.gz. Can you uncompress that file and post file_version and hail_version here? We check for backwards compatibility, but at this point I’m at a loss for what could be the issue.