Hi @heidiesteiner !
Welcome to the Hail discussion forum! My apologies that the Hail docs are difficult to navigate. You are right that the GWAS tutorial does not have an example of exporting.
There are two ways to do a linear GWAS:
-
hl.linear_regression_rows
an efficient method which does not provide betas or p-values for covariates
-
hl.linreg
a less efficient method which provides betas and p-values for every independent variable
hl.linear_regression_rows
The first way produces a new table with the betas and p-values:
mt = hl.read_matrix_table(...)
results = hl.linear_regression_rows(x=mt.GT.n_alt_alleles(), y=mt.pheno, covariates=[1])
You can export this results table like this:
results.export('path/to/results.tsv')
For example:
import hail as hl
mt = hl.balding_nichols_model(3, 10, 100)
mt = mt.annotate_cols(pheno = hl.rand_unif(0, 1))
results = hl.linear_regression_rows(y=mt.pheno, x=mt.GT.n_alt_alleles(), covariates=[1])
results.write('results.ht')
results = hl.read_table('results.ht')
results.export('results.tsv')
This will look like:
locus alleles n sum_x y_transpose_x beta standard_error t_stat p_value
1:1 ["A","C"] 10 1.1000e+01 4.1181e+00 -1.1360e-01 1.3908e-01 -8.1682e-01 4.3768e-01
hl.linreg
This method is somewhat more complicated because there are multiple betas and p-values (one for each independent variable). The key insight is that you must use the rows Table of the MatrixTable. A MatrixTable is inherently two-dimensional (e.g. rows and columns, variants and samples) which is difficult to represent as a TSV. The rows Table is just the row metadata (aka variant annotations) of a MatrixTable. The rows Table has no genotypes and no sample information. Here’s a complete working example:
import hail as hl
mt = hl.balding_nichols_model(3, 10, 100)
mt = mt.annotate_cols(pheno = hl.rand_unif(0, 1))
results = mt.annotate_rows(
gwas = hl.agg.linreg(y=mt.pheno, x=[1, mt.GT.n_alt_alleles()])
)
results = results.rows()
results = results.select(**results.gwas)
results.write('results.ht')
results = hl.read_table('results.ht')
results.export('results.tsv')
This will look like:
locus alleles beta standard_error t_stat p_value multiple_standard_error multiple_r_squared adjusted_r_squared f_stat multiple_p_value n
1:1 ["A","C"] [0.474068096467072,0.030677768466278763] [0.13671934542438463,0.1303567810882963] [3.4674544044629356,0.23533695915289118] [0.008474389322975385,0.8198595521345753] 3.2196e-01 6.8753e-03 -1.1727e-01 5.5383e-02 8.1986e-01 10
Why write to a Hail table first?
Hail tables are an efficient, binary, partitioned format for data storage. We always recommend writing to them first. It is often faster to write to a Hail Table first, read the just-written table, and finally export to a TSV.
What does ** do?
The line
results = results.select(**results.gwas)
Uses Python’s keyword-unpack syntax to “flatten” the GWAS results structure into “top-level” fields. Without this line, your TSV would contain three columns: locus, alleles, and “gwas”. The GWAS column would contain JSON dictionaries. That would look like:
locus alleles gwas
1:1 ["A","C"] {"beta":[0.49904398132709865,-0.032981218086137985],"standard_error":[0.13763596659815874,0.094977840276184],"t_stat":[3.6258253831580585,-0.3472517167186853],"p_value":[0.006725578635075359,0.7373607149133328],"multiple_standard_error":0.19231546892662096,"multiple_r_squared":0.014849148584095029,"adjusted_r_squared":-0.10829470784289308,"f_stat":0.12058375476408058,"multiple_p_value":0.7373607149133261,"n":10}
I want my variants to look like 1:1:A:C
Yeah, this is very reasonable and a little funky to do with Hail. As always, I recommend writing the results to a Table first:
results.write('results.ht')
Then this magic incantation will convert the locus and alleles into the expected format and place them first in the TSV columns. It should work with either hl.linreg
or hl.linear_regression_rows
.
results = hl.read_table('results.ht')
results = results.key_by()
results = results.select(
variant=hl.variant_str(results.locus, results.alleles),
**{field: results[field]
for field in results.row if field not in ('locus', 'alleles')}
)
results = results.key_by('variant')
results.export('results.tsv')
This would look like:
variant beta standard_error t_stat p_value multiple_standard_error multiple_r_squared adjusted_r_squared f_stat multiple_p_value n
1:100:A:C [0.5895001678068894,-0.005603007585164126] [0.19935844543752126,0.12608535167459406] [2.956986178905765,-0.044438211979013896] [0.01822997569989276,0.9656442941648257] 3.5885e-01 2.4678e-04 -1.2472e-01 1.9748e-03 9.6564e-01 10