Requesting advice on efficiently parsing through many GWAS results

Hello - I appreciate your previous advice on my gwas project. After a couple months of compute, I finally have the results. However, I now need to parse and reduce the results’ dimensionality in a specific way. My current method will likely take another couple of months compute, so I’m asking for advice.

I have 11.5k samples and 79,800 independent gwas results. It was suggested to me to batch my gwases, which I did. I grouped the gwas calculations in groups of 20. So now I have 3,990 hail tables of results. These results were saved as tsv.bgz files using the .export() method.

Here is the format of an example results table:

locus
alleles
rsid
beta
str	str	str	str
"1:10177"	"["A","AC"]"	"rs367896724"	"[-0.004992247354191095,0.0026417819469886474,-0.014409962452715221,0.004777507278440492,-0.0056093847643026,0.0023188415398793264,-0.015489004864868155,-0.014825598498744404,-0.014305084189602574,-0.015404430056206888,-0.013617417612095348,-0.02693687459808479,-0.025542563554703378,-0.005741555262049462,0.00865281622141147,0.001511397760850175,0.005827099846636984,0.031529446318792724,0.007504775012500903,-0.012863855363026016]"
"1:10235"	"["T","TA"]"	"rs540431307"	"[0.989170870829994,0.45589529664586437,0.07456373287634475,0.05167066711118691,-0.3496979091278656,0.4011920905498073,-0.3291856676880175,-0.01002863257549891,0.25749922562854016,-0.2566591179163784,0.44599895972103737,-0.310441194598682,0.8020505230033537,-0.2752300904721687,0.8214149187108155,-0.03920455900499864,0.3603526914002074,0.9642902631203855,0.5036438278449906,0.22365514191935543]"
"1:10352"	"["T","TA"]"	"rs201106462"	"[-0.0020278618338467015,-0.004461260940655055,-0.026189263875130746,0.005355074856952288,0.011523131347578843,-0.007970166278363253,0.003296571384735143,-0.00649396809716581,0.014300983043055285,4.40261684968913E-4,0.003538993449066244,-8.286792318639081E-4,-0.020481793463538518,0.008636489557145695,0.00710222047012649,0.014516439256649201,0.014693430140656416,0.005001212977414611,0.016549539697944978,0.006061780447752583]"
"1:10511"	"["G","A"]"	"rs534229142"	"[-0.22979288348177693,-0.63180137240308,-0.2642197661671302,-0.3405353460261729,-0.2246034835820954,-0.13901193871683573,-0.3569940277656782,-0.0909523045346519,0.13585679735890965,0.024173136608572173,-0.3799387952000725,0.15860955045867584,-0.19425315063369158,0.014375059046760038,-0.21189394358371727,-0.28794279936083406,-0.24285737762283013,-0.22134029471797062,-0.18501834293857816,0.021753184745015277]"
"1:10539"	"["C","A"]"	"rs537182016"	"[-0.24804359262513834,-0.7466315459222219,-0.20205947593830098,-0.6517927032349672,-0.37240926870596797,-0.20646940743643485,-0.46798830905964944,-0.40397322572818833,0.29269677247906517,-0.8018879297190337,-1.2039720781574579,-0.8974650971076943,0.24981530667688068,-0.5484618372137677,0.9445202463042509,-0.22211340995039952,-0.053686820149640525,0.23281955836171025,-0.43378366289313103,-0.9586143160806833]"
"1:10542"	"["C","T"]"	"rs572818783"	"[30.939719605803663,48.783415439832766,6.739350041755941,61.189494422123985,9.726067145889424,21.799292474564655,22.261921133117312,16.232189172074975,-9.744005641630016,48.26832943863782,11.98306878858923,-6.821133363834209,-11.593961173560622,47.550041474034096,13.846639235703389,36.026027215365744,-9.418893236891918,19.65837983385381,33.95421687385974,83.3137124899285]"
"1:10616"	"["CCGCCGTTGCAAAGGCGCGCCG","C"]"	"1:10616_CCGCCGTTGCAAAGGCGCGCCG_C"	"[0.1572195636635454,-0.03380466544471702,0.10008297890229777,0.16838572990469847,-0.06865734474318705,0.09650852382583495,-0.050156361176896494,0.047417674622061416,-0.10872370770773801,0.06152917138145618,-0.17989300833614524,-0.009355772096424981,-0.050177672665647974,0.051020923197199024,-0.014341393681502381,-0.12347218381418407,-0.1675451785660034,-0.17914536174644388,-0.2511418897114047,0.08425392120678984]"
"1:10642"	"["G","A"]"	"rs558604819"	"[0.05569764017945619,0.37420858922910666,-0.7208325550938323,-0.1916126970575322,-0.8340579896021219,-0.6570988531363929,0.5198944261836436,-0.07383483925776889,0.3513901504802878,0.2677412534025106,-0.5468265382140379,-0.48199159768215777,-1.1770404732166548,-1.1885635716955438,-0.3188556174242601,0.1817088667146935,-0.6415826723687796,0.21262382922173031,-0.004160434941234663,-0.023916588391776874]"
"1:11008"	"["C","G"]"	"rs575272151"	"[0.01242166814237488,0.03263509520576859,0.0029011642783888466,0.01925529976620068,0.03227821661755542,0.021883418635852116,0.06437957339363902,6.992602915144286E-4,0.027012784894192243,-0.020915543585543827,0.03932005240599367,0.013229880666337143,0.031846916340306015,0.016521620844440307,0.0050513889209964245,0.017229061932873962,-0.009992510637037553,0.008859708896325202,-0.015770399904468206,0.0240271472515886]"
"1:11012"	"["C","G"]"	"rs544419019"	"[0.01242166814237488,0.03263509520576859,0.0029011642783888466,0.01925529976620068,0.03227821661755542,0.021883418635852116,0.06437957339363902,6.992602915144286E-4,0.027012784894192243,-0.020915543585543827,0.03932005240599367,0.013229880666337143,0.031846916340306015,0.016521620844440307,0.0050513889209964245,0.017229061932873962,-0.009992510637037553,0.008859708896325202,-0.015770399904468206,0.0240271472515886]"

The issue is that the 78,900 gwases are not completely independent. We have 400 brain regions and a 400 x 400 connectivity matrix, so 79,800 unique brain connections. Now I need to average to get the average beta for each region for each rsid. The issue is that the gwas results are distributed among 3,990 tables.

My first attempt was: for each rsid, loop through the tables, filter, and append the results to a list. So, for each rsid, I’ll have a list of 78,900 betas values that I can then average for each brain region.

import hail as hl
import pickle

hl.init()

rsid = 'rs367896724'

betas = []
for i in range(3990):
    print(f'gwas_betas_idx{i}_step20.tsv.bgz')
    file = f'/data/gwas/hail-tsvs/tsvs/gwas_betas_idx{i}_step20.tsv.bgz'
    data = hl.import_table(file)
    rs_data = data.filter(data.rsid == rsid)
    rs_struct = rs_data.take(1)
    lst = [float(x) for x in rs_struct[0].beta[1:-1].split(',')]
    betas.extend(lst)

with open(f'/data/gwas/hail-tsvs/rsid-betas/{rsid}.pkl', 'wb') as f:
	pickle.dump(betas, f)

This works, but it takes about a minute to run for each table, so for one rsid I’m looking at ~2.75 days of processing. My next attempt will be to distribute this in some way, probably distribute at the rsid level since I have so many of them, but I wanted to get your advice first.

Thank you in advance!

It’s possible to import all the files as a single Hail Table with hl.import_table. You can use the source_file_field='filename' argument to get the input file as a field of the table as well, if you need. You could do something like:

ht = hl.import_table([all_the_tables], ...)
ht2 = ht.group_by(ht.rsid).aggregate(mean_beta = hl.agg.explode(lambda x: hl.agg.mean(x), ht.beta[1:-1]))

1 Like

Thanks, @tpoterba, I’ve tried this a few times. The general idea of grouping and aggregating all of the tables into one hail table makes sense. However, I’m having some difficulty figuring out how to implement what I want.

See above for an example of one of the tables. Note that the column beta is a str that looks like a list:

"[-0.004992247354191095,0.0026417819469886474,-0.014409962452715221,0.004777507278440492,-0.0056093847643026,0.0023188415398793264,-0.015489004864868155,-0.014825598498744404,-0.014305084189602574,-0.015404430056206888,-0.013617417612095348,-0.02693687459808479,-0.025542563554703378,-0.005741555262049462,0.00865281622141147,0.001511397760850175,0.005827099846636984,0.031529446318792724,0.007504775012500903,-0.012863855363026016]"

I need to do two things with this:

  1. Convert the str to a list of floats
  2. Group the tables by rsid, then append the lists. The order of the floats within each list and the order of the lists are important.

What I’m hoping for is that for each rsid I’ll have a list of all betas (79800) in order, so that I can then run something like the following code:

i, j = np.triu_indices(400, k=1)
M = np.zeros((400, 400))
M[i, j] = list_data
M[j, i] = list_data
np.mean(M,axis=0)

Essentially I’m taking the 78900 floats and converting it to a symmetric 400x400 matrix so that I can take the mean of each column (or row since it’s symmetric).

Do you have an idea of how I can implement any of the steps above in hail?

Hello - I’m following up here.

This is what my tables look like

locus
alleles
rsid
beta
str	str	str	str
"1:10177"	"["A","AC"]"	"rs367896724"	"[-0.004992247354191095,0.0026417819469886474,-0.014409962452715221,0.004777507278440492,-0.0056093847643026,0.0023188415398793264,-0.015489004864868155,-0.014825598498744404,-0.014305084189602574,-0.015404430056206888,-0.013617417612095348,-0.02693687459808479,-0.025542563554703378,-0.005741555262049462,0.00865281622141147,0.001511397760850175,0.005827099846636984,0.031529446318792724,0.007504775012500903,-0.012863855363026016]"
"1:10235"	"["T","TA"]"	"rs540431307"	"[0.989170870829994,0.45589529664586437,0.07456373287634475,0.05167066711118691,-0.3496979091278656,0.4011920905498073,-0.3291856676880175,-0.01002863257549891,0.25749922562854016,-0.2566591179163784,0.44599895972103737,-0.310441194598682,0.8020505230033537,-0.2752300904721687,0.8214149187108155,-0.03920455900499864,0.3603526914002074,0.9642902631203855,0.5036438278449906,0.22365514191935543]"
"1:10352"	"["T","TA"]"	"rs201106462"	"[-0.0020278618338467015,-0.004461260940655055,-0.026189263875130746,0.005355074856952288,0.011523131347578843,-0.007970166278363253,0.003296571384735143,-0.00649396809716581,0.014300983043055285,4.40261684968913E-4,0.003538993449066244,-8.286792318639081E-4,-0.020481793463538518,0.008636489557145695,0.00710222047012649,0.014516439256649201,0.014693430140656416,0.005001212977414611,0.016549539697944978,0.006061780447752583]"
"1:10511"	"["G","A"]"	"rs534229142"	"[-0.22979288348177693,-0.63180137240308,-0.2642197661671302,-0.3405353460261729,-0.2246034835820954,-0.13901193871683573,-0.3569940277656782,-0.0909523045346519,0.13585679735890965,0.024173136608572173,-0.3799387952000725,0.15860955045867584,-0.19425315063369158,0.014375059046760038,-0.21189394358371727,-0.28794279936083406,-0.24285737762283013,-0.22134029471797062,-0.18501834293857816,0.021753184745015277]"
"1:10539"	"["C","A"]"	"rs537182016"	"[-0.24804359262513834,-0.7466315459222219,-0.20205947593830098,-0.6517927032349672,-0.37240926870596797,-0.20646940743643485,-0.46798830905964944,-0.40397322572818833,0.29269677247906517,-0.8018879297190337,-1.2039720781574579,-0.8974650971076943,0.24981530667688068,-0.5484618372137677,0.9445202463042509,-0.22211340995039952,-0.053686820149640525,0.23281955836171025,-0.43378366289313103,-0.9586143160806833]"
"1:10542"	"["C","T"]"	"rs572818783"	"[30.939719605803663,48.783415439832766,6.739350041755941,61.189494422123985,9.726067145889424,21.799292474564655,22.261921133117312,16.232189172074975,-9.744005641630016,48.26832943863782,11.98306878858923,-6.821133363834209,-11.593961173560622,47.550041474034096,13.846639235703389,36.026027215365744,-9.418893236891918,19.65837983385381,33.95421687385974,83.3137124899285]"
"1:10616"	"["CCGCCGTTGCAAAGGCGCGCCG","C"]"	"1:10616_CCGCCGTTGCAAAGGCGCGCCG_C"	"[0.1572195636635454,-0.03380466544471702,0.10008297890229777,0.16838572990469847,-0.06865734474318705,0.09650852382583495,-0.050156361176896494,0.047417674622061416,-0.10872370770773801,0.06152917138145618,-0.17989300833614524,-0.009355772096424981,-0.050177672665647974,0.051020923197199024,-0.014341393681502381,-0.12347218381418407,-0.1675451785660034,-0.17914536174644388,-0.2511418897114047,0.08425392120678984]"
"1:10642"	"["G","A"]"	"rs558604819"	"[0.05569764017945619,0.37420858922910666,-0.7208325550938323,-0.1916126970575322,-0.8340579896021219,-0.6570988531363929,0.5198944261836436,-0.07383483925776889,0.3513901504802878,0.2677412534025106,-0.5468265382140379,-0.48199159768215777,-1.1770404732166548,-1.1885635716955438,-0.3188556174242601,0.1817088667146935,-0.6415826723687796,0.21262382922173031,-0.004160434941234663,-0.023916588391776874]"
"1:11008"	"["C","G"]"	"rs575272151"	"[0.01242166814237488,0.03263509520576859,0.0029011642783888466,0.01925529976620068,0.03227821661755542,0.021883418635852116,0.06437957339363902,6.992602915144286E-4,0.027012784894192243,-0.020915543585543827,0.03932005240599367,0.013229880666337143,0.031846916340306015,0.016521620844440307,0.0050513889209964245,0.017229061932873962,-0.009992510637037553,0.008859708896325202,-0.015770399904468206,0.0240271472515886]"
"1:11012"	"["C","G"]"	"rs544419019"	"[0.01242166814237488,0.03263509520576859,0.0029011642783888466,0.01925529976620068,0.03227821661755542,0.021883418635852116,0.06437957339363902,6.992602915144286E-4,0.027012784894192243,-0.020915543585543827,0.03932005240599367,0.013229880666337143,0.031846916340306015,0.016521620844440307,0.0050513889209964245,0.017229061932873962,-0.009992510637037553,0.008859708896325202,-0.015770399904468206,0.0240271472515886]"

I need to:

  1. convert beta to a list of floats.
    I’ve gotten as far as converting to a list of strings: data.beta[1:-1].split(','), but how to I make this field a list of floats instead of strings? I feel like there must be a more efficient way than using list comprehension on every single row, but even then I’m not sure if that’s possible in hail?

  2. I need to group many tables by rsid and extend beta accordingly.
    That is, I have table1, table2, table3, etc… all of which have the same rsid but 20 different betas. Ultimately I need one long list per rsid.

  3. Once I have the long list of numbers, I have a function I need to apply to them. Essentially,

i, j = np.triu_indices(400, k=1)
M = np.zeros((400, 400))
M[i, j] = betas
M[j, i] = betas
np.mean(M,axis=0)

Here, I’m projecting a list of 79,800 values into a 400 x 400 matrix and taking the mean of each row. Ultimately I need to have a list of 400 floats per rsid

Help on any of the above steps would be great. I’ve been stuck for a couple of weeks.

Eep, did Hail generate this file? We should fix our exporters if that’s the case. The "["A","AC"]" looks like a real PITA to parse.

  1. I think you want this:
import hail as hl 
ht = hl.import_table('bar.tsv', find_replace=("\"(\t)\"", "\t"), filter='str\tstr\tstr\tstr') 
ht = ht.select( 
    locus = hl.parse_locus(ht.locus[1:]), 
    rsid = ht.rsid, 
    alleles = hl.parse_json(ht.alleles, hl.tarray(hl.tstr)), 
    beta = hl.parse_json(ht.beta[:-1], hl.tarray(hl.tfloat)) 
)
ht = ht.key_by(ht.locus)
  1. The quick and dirty solution is to join these twenty tables. You’ll probably want to do a binary tree of joins. We have faster ways to do this but they’re more complicated. Try this first and see if its fast enough. Here’s an example for joining two tables. You’ll want to do this recursively. So you’ll join the 20 files into 10 files, then those ten into 5, then maybe just join the last five all together explicitly.
ht = ht.anntoate(betas = ht.betas + ht2[ht.locus].betas)
  1. I’m not sure I follow this upper triangular thing. If you want the mean of each row then do this:
ht = ht.anntoate(mean = hl.mean(ht.betas))

EDIT: Sorry, I should have clarified. The table I pasted was python output of ht.show(), not the raw contents of the file itself.

Yes, Hail created these files, but it’s likely that the issue had to do with my specific use (and maybe error). I was running batches of 20 gwases at a time.

	batch_results = hl.linear_regression_rows(
	    [geno_subset_mt.phenos[i] for i in range(batch_size)],
	    geno_subset_mt.dosage,
	    covariates=[1.0],
	    pass_through=[geno_subset_mt.rsid, geno_subset_mt.mfi.chrom],
	    block_size=128)

	# save the results
	res = batch_results.select('rsid', 'beta')
	res = res.filter(hl.is_nan(res.beta[0]), keep=False)
	res.export(f'/gwas/tsvs/gwas_betas_idx{settings["idx"]}_step{settings["step"]}.tsv.bgz')

In hindsight, I probably should have saved the hail table rather than exporting it to a tsv.bgz. I was trying to minimize the number of files on disk (because I had 3990 batches of 20 gwases each).

When I hl.import_table(tsv.bgz) The file imports fine, and I have separate fields. The only issue is that beta is a “list-looking” string but I need it to be a proper list of floats.

Eg, I have:

"[-0.004992247354191095,0.0026417819469886474,-0.014409962452715221,0.004777507278440492,-0.0056093847643026,0.0023188415398793264,-0.015489004864868155,-0.014825598498744404,-0.014305084189602574,-0.015404430056206888,-0.013617417612095348,-0.02693687459808479,-0.025542563554703378,-0.005741555262049462,0.00865281622141147,0.001511397760850175,0.005827099846636984,0.031529446318792724,0.007504775012500903,-0.012863855363026016]"

I want:

[-0.004992247354191095,0.0026417819469886474,-0.014409962452715221,0.004777507278440492,-0.0056093847643026,0.0023188415398793264,-0.015489004864868155,-0.014825598498744404,-0.014305084189602574,-0.015404430056206888,-0.013617417612095348,-0.02693687459808479,-0.025542563554703378,-0.005741555262049462,0.00865281622141147,0.001511397760850175,0.005827099846636984,0.031529446318792724,0.007504775012500903,-0.012863855363026016]

Ah, OK, so, to load those files directly is actually quite straightforward. See below. Yeah, if you do this again you should use the Hail Table format rather than TSVs.

import hail as hl 
ht = hl.import_table('bar.tsv.gz')) 
ht = ht.select( 
    locus = hl.parse_locus(ht.locus), 
    rsid = ht.rsid, 
    alleles = hl.parse_json(ht.alleles, hl.tarray(hl.tstr)), 
    beta = hl.parse_json(ht.beta, hl.tarray(hl.tfloat)) 
)
ht = ht.key_by(ht.locus)

Yes, @danking this importing code worked. Thank you! The data are now in the proper format.
Next step is to merge the tables.

For each table, ht.beta is an <ArrayNumericExpression of type array<float64>> that contains 20 numbers for each locus. I have 3990 of these tables. Ultimately I need an array of 79800 numbers (20 * 3990) to do the next step. So, something like:

main_ht = hl.import_table('table1.tsv.gz')
other_tables = ['table2.tsv.gz', 'table3.tsv.gz', 'table4.tsv.gz', etc...]

for tt in other_tables:
    temp_ht =  hl.import_table(tt)
    main_ht = main_ht.annotate(beta = main_ht.beta + temp_ht[main_ht.locus].beta)

Or, rather than extending main_ht maybe creating an empty table and progressively building on each iteration of the loop?

I have tried your suggestion:

ht = ht.anntoate(betas = ht.betas + ht2[ht.locus].betas)

But it is not joining the betas for the 2nd table:

len(ht.beta.take(1)[0])
20

len(ht2.beta.take(1)[0])
20

ht = ht.anntoate(betas = ht.betas + ht2[ht.locus].betas)
len(ht.beta.take(1)[0])
20

EDIT: The following works

ht = ht.anntoate(betas = ht.betas.extend(ht2[ht.locus].betas))
1 Like