Hello - I appreciate your previous advice on my gwas project. After a couple months of compute, I finally have the results. However, I now need to parse and reduce the results’ dimensionality in a specific way. My current method will likely take another couple of months compute, so I’m asking for advice.
I have 11.5k samples and 79,800 independent gwas results. It was suggested to me to batch my gwases, which I did. I grouped the gwas calculations in groups of 20. So now I have 3,990 hail tables of results. These results were saved as tsv.bgz
files using the .export()
method.
Here is the format of an example results table:
locus
alleles
rsid
beta
str str str str
"1:10177" "["A","AC"]" "rs367896724" "[-0.004992247354191095,0.0026417819469886474,-0.014409962452715221,0.004777507278440492,-0.0056093847643026,0.0023188415398793264,-0.015489004864868155,-0.014825598498744404,-0.014305084189602574,-0.015404430056206888,-0.013617417612095348,-0.02693687459808479,-0.025542563554703378,-0.005741555262049462,0.00865281622141147,0.001511397760850175,0.005827099846636984,0.031529446318792724,0.007504775012500903,-0.012863855363026016]"
"1:10235" "["T","TA"]" "rs540431307" "[0.989170870829994,0.45589529664586437,0.07456373287634475,0.05167066711118691,-0.3496979091278656,0.4011920905498073,-0.3291856676880175,-0.01002863257549891,0.25749922562854016,-0.2566591179163784,0.44599895972103737,-0.310441194598682,0.8020505230033537,-0.2752300904721687,0.8214149187108155,-0.03920455900499864,0.3603526914002074,0.9642902631203855,0.5036438278449906,0.22365514191935543]"
"1:10352" "["T","TA"]" "rs201106462" "[-0.0020278618338467015,-0.004461260940655055,-0.026189263875130746,0.005355074856952288,0.011523131347578843,-0.007970166278363253,0.003296571384735143,-0.00649396809716581,0.014300983043055285,4.40261684968913E-4,0.003538993449066244,-8.286792318639081E-4,-0.020481793463538518,0.008636489557145695,0.00710222047012649,0.014516439256649201,0.014693430140656416,0.005001212977414611,0.016549539697944978,0.006061780447752583]"
"1:10511" "["G","A"]" "rs534229142" "[-0.22979288348177693,-0.63180137240308,-0.2642197661671302,-0.3405353460261729,-0.2246034835820954,-0.13901193871683573,-0.3569940277656782,-0.0909523045346519,0.13585679735890965,0.024173136608572173,-0.3799387952000725,0.15860955045867584,-0.19425315063369158,0.014375059046760038,-0.21189394358371727,-0.28794279936083406,-0.24285737762283013,-0.22134029471797062,-0.18501834293857816,0.021753184745015277]"
"1:10539" "["C","A"]" "rs537182016" "[-0.24804359262513834,-0.7466315459222219,-0.20205947593830098,-0.6517927032349672,-0.37240926870596797,-0.20646940743643485,-0.46798830905964944,-0.40397322572818833,0.29269677247906517,-0.8018879297190337,-1.2039720781574579,-0.8974650971076943,0.24981530667688068,-0.5484618372137677,0.9445202463042509,-0.22211340995039952,-0.053686820149640525,0.23281955836171025,-0.43378366289313103,-0.9586143160806833]"
"1:10542" "["C","T"]" "rs572818783" "[30.939719605803663,48.783415439832766,6.739350041755941,61.189494422123985,9.726067145889424,21.799292474564655,22.261921133117312,16.232189172074975,-9.744005641630016,48.26832943863782,11.98306878858923,-6.821133363834209,-11.593961173560622,47.550041474034096,13.846639235703389,36.026027215365744,-9.418893236891918,19.65837983385381,33.95421687385974,83.3137124899285]"
"1:10616" "["CCGCCGTTGCAAAGGCGCGCCG","C"]" "1:10616_CCGCCGTTGCAAAGGCGCGCCG_C" "[0.1572195636635454,-0.03380466544471702,0.10008297890229777,0.16838572990469847,-0.06865734474318705,0.09650852382583495,-0.050156361176896494,0.047417674622061416,-0.10872370770773801,0.06152917138145618,-0.17989300833614524,-0.009355772096424981,-0.050177672665647974,0.051020923197199024,-0.014341393681502381,-0.12347218381418407,-0.1675451785660034,-0.17914536174644388,-0.2511418897114047,0.08425392120678984]"
"1:10642" "["G","A"]" "rs558604819" "[0.05569764017945619,0.37420858922910666,-0.7208325550938323,-0.1916126970575322,-0.8340579896021219,-0.6570988531363929,0.5198944261836436,-0.07383483925776889,0.3513901504802878,0.2677412534025106,-0.5468265382140379,-0.48199159768215777,-1.1770404732166548,-1.1885635716955438,-0.3188556174242601,0.1817088667146935,-0.6415826723687796,0.21262382922173031,-0.004160434941234663,-0.023916588391776874]"
"1:11008" "["C","G"]" "rs575272151" "[0.01242166814237488,0.03263509520576859,0.0029011642783888466,0.01925529976620068,0.03227821661755542,0.021883418635852116,0.06437957339363902,6.992602915144286E-4,0.027012784894192243,-0.020915543585543827,0.03932005240599367,0.013229880666337143,0.031846916340306015,0.016521620844440307,0.0050513889209964245,0.017229061932873962,-0.009992510637037553,0.008859708896325202,-0.015770399904468206,0.0240271472515886]"
"1:11012" "["C","G"]" "rs544419019" "[0.01242166814237488,0.03263509520576859,0.0029011642783888466,0.01925529976620068,0.03227821661755542,0.021883418635852116,0.06437957339363902,6.992602915144286E-4,0.027012784894192243,-0.020915543585543827,0.03932005240599367,0.013229880666337143,0.031846916340306015,0.016521620844440307,0.0050513889209964245,0.017229061932873962,-0.009992510637037553,0.008859708896325202,-0.015770399904468206,0.0240271472515886]"
The issue is that the 78,900 gwases are not completely independent. We have 400 brain regions and a 400 x 400 connectivity matrix, so 79,800 unique brain connections. Now I need to average to get the average beta for each region for each rsid. The issue is that the gwas results are distributed among 3,990 tables.
My first attempt was: for each rsid, loop through the tables, filter, and append the results to a list. So, for each rsid, I’ll have a list of 78,900 betas values that I can then average for each brain region.
import hail as hl
import pickle
hl.init()
rsid = 'rs367896724'
betas = []
for i in range(3990):
print(f'gwas_betas_idx{i}_step20.tsv.bgz')
file = f'/data/gwas/hail-tsvs/tsvs/gwas_betas_idx{i}_step20.tsv.bgz'
data = hl.import_table(file)
rs_data = data.filter(data.rsid == rsid)
rs_struct = rs_data.take(1)
lst = [float(x) for x in rs_struct[0].beta[1:-1].split(',')]
betas.extend(lst)
with open(f'/data/gwas/hail-tsvs/rsid-betas/{rsid}.pkl', 'wb') as f:
pickle.dump(betas, f)
This works, but it takes about a minute to run for each table, so for one rsid I’m looking at ~2.75 days of processing. My next attempt will be to distribute this in some way, probably distribute at the rsid level since I have so many of them, but I wanted to get your advice first.
Thank you in advance!