Export GWAS summary statistics to a .txt file

Hi there,

I used Hail’s “hl.agg.linreg” function to run GWAS and would like to export the summary statistics into a .txt file. Specifically, here’s my code for GWAS:

mt = mt.annotate_rows(gwas = hl.agg.linreg(mt.quant_pheno, [1, mt.GT.n_alt_alleles()]))

And here’s my code to export the output:

mt.gwas.flatten().export(‘EA_hail.txt’)

I only have ~6000 SNPs in the summary statistics, but it took about 5 minutes to generate the output. Also, the “flatten” function didn’t work properly, as the statistics were clumped together instead of showing up in separate columns when I read the .txt file into R. May I ask if there’s any mistake in my code? Also, how could I improve the speed of exporting output? Thank you very much!

Best,
Ethan

Thank you for your question, EZXYZ! Our apologies for getting back to you after awhile.

A few follow up questions:

  1. I am assuming that you are exporting everything in gwas regardless of magnitude of point estimates or significance threshold?
  2. If you were to export the mt, I would suggest first saving the gwas annotated rows of your summary statistics into a Hail table and then export without flattening the data into the table. Happy to help further if I possibly got your question wrong.

Thanks!

Great, this really helps, thanks!

Could you please be more explicit? I am new to hail. Suppose your gwas is in “mt.gwas”, and you want to save the summary stats to “gwas.summary.txt”, what would be the code snippet?

Thanks a lot!

I would highly suggest going through the tutorial for GWAS Hail | GWAS Tutorial
After you are done with running your GWAS, you should be able to export it as a csv table

I was going to say it would be very helpful to include that part of the code in the GWAS tutorial. I did go through the GWAS tutorial multiple times, but was unable to find the code to save the results to a csv file. I was able to use mt.gwas.flatten().export(‘<myfile.tsv’) to save the results, though.

Thanks.

Hello @kumarveerapen ,

I haven’t been able to export summary statistics based on this solution.

I am using the same function to run a gwas…similar to mt = mt.annotate_rows(gwas = hl.agg.linreg(mt.pheno, [1, mt.GT]) is it possible to provide a code snippet for how to create a summary statistics file containing the GWAS struct as columns and loci as rows as seen with PLINK files?

I don’t see anywhere in the GWAS Tutorial that exports literally anything so your response below to @mocksu is super strange…are there other tutorials for newbies to multidimensional matricies that you may suggest that would help with this confusion?

Thanks for your help!

Hi @heidiesteiner !

Welcome to the Hail discussion forum! My apologies that the Hail docs are difficult to navigate. You are right that the GWAS tutorial does not have an example of exporting.

There are two ways to do a linear GWAS:

  1. hl.linear_regression_rows an efficient method which does not provide betas or p-values for covariates
  2. hl.linreg a less efficient method which provides betas and p-values for every independent variable

hl.linear_regression_rows

The first way produces a new table with the betas and p-values:

mt = hl.read_matrix_table(...)
results = hl.linear_regression_rows(x=mt.GT.n_alt_alleles(), y=mt.pheno, covariates=[1])

You can export this results table like this:

results.export('path/to/results.tsv')

For example:

import hail as hl

mt = hl.balding_nichols_model(3, 10, 100) 
mt = mt.annotate_cols(pheno = hl.rand_unif(0, 1)) 
results = hl.linear_regression_rows(y=mt.pheno, x=mt.GT.n_alt_alleles(), covariates=[1]) 
results.write('results.ht')
results = hl.read_table('results.ht')
results.export('results.tsv')                                                  

This will look like:

locus	alleles	n	sum_x	y_transpose_x	beta	standard_error	t_stat	p_value
1:1	["A","C"]	10	1.1000e+01	4.1181e+00	-1.1360e-01	1.3908e-01	-8.1682e-01	4.3768e-01

hl.linreg

This method is somewhat more complicated because there are multiple betas and p-values (one for each independent variable). The key insight is that you must use the rows Table of the MatrixTable. A MatrixTable is inherently two-dimensional (e.g. rows and columns, variants and samples) which is difficult to represent as a TSV. The rows Table is just the row metadata (aka variant annotations) of a MatrixTable. The rows Table has no genotypes and no sample information. Here’s a complete working example:

import hail as hl

mt = hl.balding_nichols_model(3, 10, 100) 
mt = mt.annotate_cols(pheno = hl.rand_unif(0, 1)) 
results = mt.annotate_rows( 
    gwas = hl.agg.linreg(y=mt.pheno, x=[1, mt.GT.n_alt_alleles()])
) 
results = results.rows() 
results = results.select(**results.gwas)
results.write('results.ht')
results = hl.read_table('results.ht')
results.export('results.tsv')

This will look like:

locus	alleles	beta	standard_error	t_stat	p_value	multiple_standard_error	multiple_r_squared	adjusted_r_squared	f_stat	multiple_p_value	n
1:1	["A","C"]	[0.474068096467072,0.030677768466278763]	[0.13671934542438463,0.1303567810882963]	[3.4674544044629356,0.23533695915289118]	[0.008474389322975385,0.8198595521345753]	3.2196e-01	6.8753e-03	-1.1727e-01	5.5383e-02	8.1986e-01	10

Why write to a Hail table first?

Hail tables are an efficient, binary, partitioned format for data storage. We always recommend writing to them first. It is often faster to write to a Hail Table first, read the just-written table, and finally export to a TSV.

What does ** do?

The line

results = results.select(**results.gwas)

Uses Python’s keyword-unpack syntax to “flatten” the GWAS results structure into “top-level” fields. Without this line, your TSV would contain three columns: locus, alleles, and “gwas”. The GWAS column would contain JSON dictionaries. That would look like:

locus	alleles	gwas
1:1	["A","C"]	{"beta":[0.49904398132709865,-0.032981218086137985],"standard_error":[0.13763596659815874,0.094977840276184],"t_stat":[3.6258253831580585,-0.3472517167186853],"p_value":[0.006725578635075359,0.7373607149133328],"multiple_standard_error":0.19231546892662096,"multiple_r_squared":0.014849148584095029,"adjusted_r_squared":-0.10829470784289308,"f_stat":0.12058375476408058,"multiple_p_value":0.7373607149133261,"n":10}

I want my variants to look like 1:1:A:C

Yeah, this is very reasonable and a little funky to do with Hail. As always, I recommend writing the results to a Table first:

results.write('results.ht')

Then this magic incantation will convert the locus and alleles into the expected format and place them first in the TSV columns. It should work with either hl.linreg or hl.linear_regression_rows.

results = hl.read_table('results.ht')
results = results.key_by() 
results = results.select( 
    variant=hl.variant_str(results.locus, results.alleles), 
    **{field: results[field] 
       for field in results.row if field not in ('locus', 'alleles')} 
) 
results = results.key_by('variant') 
results.export('results.tsv')

This would look like:

variant	beta	standard_error	t_stat	p_value	multiple_standard_error	multiple_r_squared	adjusted_r_squared	f_stat	multiple_p_value	n
1:100:A:C	[0.5895001678068894,-0.005603007585164126]	[0.19935844543752126,0.12608535167459406]	[2.956986178905765,-0.044438211979013896]	[0.01822997569989276,0.9656442941648257]	3.5885e-01	2.4678e-04	-1.2472e-01	1.9748e-03	9.6564e-01	10

Hello @danking !

Holy cow - this is so helpful! I really hope you had some of this prepared already!

The bit on what ** does is exactly what I needed but didn’t know how to search for this (and was told not to flatten in the solved post so didn’t bother searching for that - oops!)

Still a bit unclear what’s happening here:

but, now I can learn from this snippet! Thank you!

I’m bound to hl.linreg for this specific project but it’s good to know the difference in methods. Thank you for this explanation.

Thank you so much @danking !!!

1 Like