Hello - I am following using the methods suggested to me in this thread.
Briefly, I have 79800 gwases to run on ~11.5k samples. For efficiency, @tpoterba suggested that I run the gwas in batches. Just to make sure I know what’s happening and understand the output, I’m testing the batch suggestion, but only for a batch size of 1. The gwas results are confusing, though. For each row (eg, each rsid), metrics like beta, t_stat, and p_value are arrays instead of a single float value. It makes sense that they would be an array of one, since the subset method enters an array as the x side of the equation. So that when I expand the batches to larger sizes, the size arrays in the results would increase as well.
However, for the gwas using an array of 1, instead of having an array containing 1 p_value per rsid, the results are much larger:
gwas.p_value.show()
locus alleles p_value
locus<GRCh37> array<str> array<float64>
1:57222 ["T","C"] [9.16e-01,3.23e-01,9.61e-01,6.96e-02,7.45e-01,2.25e-01,8.18e-01,6.95e-01,8.49e-01,7.77e-01,5.69e-01,6.39e-01,5.56e-01,3.00e-01,3.79e-01,3.67e-01,7.50e-01,9.23e-01,6.73e-01,3.68e-01,5.24e-01,9.56e-01,7.51e-01,9.52e-01,7.38e-01,7.92e-01,4.32e-01,3.54e-01,1.86e-01,8.24e-01,4.98e-01,9.96e-01]
1:60433 ["A","G"] [3.95e-01,8.55e-01,4.19e-01,8.79e-01,4.42e-01,5.34e-01,9.38e-01,9.58e-01,8.13e-01,2.02e-01,8.70e-02,1.28e-01,6.13e-01,4.38e-01,3.25e-01,9.40e-01,7.38e-01,1.69e-01,5.13e-01,6.35e-01,9.72e-01,5.15e-01,8.16e-01,9.57e-01,6.89e-01,7.80e-01,7.98e-01,4.36e-01,8.26e-01,3.92e-01,1.46e-01,4.10e-01]
1:63336 ["C","T"] [6.86e-03,5.18e-01,7.90e-01,2.06e-01,9.57e-01,1.83e-02,1.10e-01,3.83e-01,5.97e-01,8.65e-01,7.18e-01,4.94e-01,9.17e-01,7.28e-01,6.74e-01,4.81e-01,8.48e-01,3.96e-01,1.95e-01,5.45e-01,5.45e-01,5.85e-01,8.27e-01,5.65e-01,8.05e-01,3.54e-01,7.87e-01,3.99e-01,7.90e-01,8.11e-01,7.77e-01,7.53e-01]
1:64795 ["A","G"] [4.99e-01,3.70e-01,6.01e-01,6.84e-01,2.80e-01,9.85e-01,1.63e-01,4.68e-01,2.57e-01,1.99e-01,1.68e-02,1.49e-01,1.04e-01,2.90e-01,7.12e-02,9.15e-01,3.78e-01,8.55e-01,6.45e-01,4.06e-01,6.85e-01,3.73e-01,4.19e-01,4.27e-01,4.85e-01,9.05e-01,1.81e-01,6.36e-01,5.42e-03,1.17e-02,2.44e-02,6.38e-03]
1:76897 ["G","A"] [6.84e-01,9.51e-01,7.86e-01,9.18e-01,4.77e-01,9.33e-01,8.41e-01,9.86e-01,9.81e-01,4.30e-01,9.00e-01,8.21e-01,6.45e-01,5.33e-01,8.64e-01,6.66e-01,9.45e-01,9.83e-01,8.95e-01,6.13e-01,8.00e-01,8.60e-01,6.95e-01,5.17e-01,7.58e-01,7.07e-01,9.62e-01,9.11e-01,7.05e-01,9.80e-01,7.61e-01,7.71e-01]
1:77874 ["G","A"] [7.49e-01,3.76e-01,4.87e-01,8.39e-01,6.69e-01,7.06e-01,9.07e-01,5.75e-01,5.09e-01,6.23e-01,7.95e-01,4.60e-01,3.82e-01,4.70e-01,5.28e-01,3.52e-01,1.04e-01,7.42e-01,2.46e-01,8.70e-01,5.84e-01,5.00e-01,6.42e-01,5.77e-01,5.74e-01,9.84e-01,6.26e-01,4.79e-01,9.22e-01,8.09e-01,7.74e-01,7.90e-01]
1:82036 ["A","T"] [9.96e-01,5.29e-01,8.02e-02,4.27e-01,5.81e-01,3.29e-01,4.27e-01,3.76e-01,7.48e-01,7.44e-01,4.84e-01,9.67e-01,8.68e-01,2.44e-01,1.57e-01,6.57e-01,8.45e-01,3.74e-01,9.18e-01,1.20e-01,1.26e-01,6.82e-01,9.82e-01,6.96e-01,3.02e-01,6.63e-01,1.89e-01,1.50e-01,6.35e-01,8.79e-01,4.25e-01,3.15e-01]
1:84156 ["A","C"] [5.12e-01,4.30e-01,7.06e-01,6.67e-01,7.12e-01,6.01e-01,8.78e-01,9.07e-01,3.83e-01,3.43e-01,2.12e-01,5.10e-01,5.31e-01,8.36e-01,7.93e-01,3.02e-01,5.48e-01,7.42e-01,7.59e-01,7.30e-01,2.50e-01,9.14e-01,5.08e-01,4.62e-01,5.82e-01,1.37e-01,4.75e-01,6.54e-01,1.12e-01,4.17e-01,9.72e-01,5.91e-01]
1:85107 ["A","G"] [2.13e-01,1.08e-01,2.36e-01,6.61e-01,5.95e-01,8.04e-01,7.84e-01,3.97e-01,2.83e-01,3.15e-02,1.22e-01,9.26e-01,9.63e-01,3.83e-01,6.57e-01,6.66e-02,4.54e-01,5.11e-01,3.96e-01,4.96e-01,2.06e-01,5.76e-01,7.56e-01,7.55e-01,5.18e-01,4.35e-02,3.34e-01,8.56e-01,9.66e-02,7.02e-01,7.92e-01,3.56e-01]
1:85941 ["C","T"] [3.65e-01,4.20e-01,9.31e-01,5.15e-01,8.29e-01,5.80e-01,5.15e-01,1.71e-01,6.57e-02,1.30e-01,4.23e-01,7.52e-01,9.14e-01,9.60e-01,4.93e-01,1.27e-01,6.44e-02,5.14e-02,5.75e-02,5.78e-01,2.23e-02,6.29e-02,1.38e-01,4.09e-01,2.97e-01,8.42e-01,4.43e-01,5.48e-01,1.97e-02,2.58e-01,2.36e-01,8.16e-01]
Here is the code I used to generate these results:
batch_size = 1
batch_index = 0
# read in the pheno matrix table
phenos_mt = hl.read_matrix_table('idps_i_deconf.mt')
phenos_mt = phenos_mt.add_col_index()
# find the start and stop columns of the idps
pheno_start = batch_size * batch_index
pheno_end = pheno_start + batch_size
# filter the idps of interest
phenos_mt = phenos_mt.filter_cols((phenos_mt.col_idx >= pheno_start) & (phenos_mt.col_idx < pheno_end))
phenos_table = phenos_mt.annotate_rows(phenos = hl.agg.collect(phenos_mt.x)).rows()
# merge the idps with the genetic data
geno_subset_mt = geno_subset_mt.annotate_cols(phenos = phenos_table[geno_subset_mt.s].phenos)
# run the gwas
gwas = hl.linear_regression_rows(
[geno_subset_mt.phenos[i] for i in range(batch_size)],
geno_subset_mt.dosage,
covariates=[1.0],
block_size=128)
Thanks in advance!