Requesting advice on efficiently parsing through many GWAS results

dmoracze · April 8, 2022, 3:52pm

Hello - I appreciate your previous advice on my gwas project. After a couple months of compute, I finally have the results. However, I now need to parse and reduce the results’ dimensionality in a specific way. My current method will likely take another couple of months compute, so I’m asking for advice.

I have 11.5k samples and 79,800 independent gwas results. It was suggested to me to batch my gwases, which I did. I grouped the gwas calculations in groups of 20. So now I have 3,990 hail tables of results. These results were saved as tsv.bgz files using the .export() method.

Here is the format of an example results table:

locus
alleles
rsid
beta
str	str	str	str
"1:10177"	"["A","AC"]"	"rs367896724"	"[-0.004992247354191095,0.0026417819469886474,-0.014409962452715221,0.004777507278440492,-0.0056093847643026,0.0023188415398793264,-0.015489004864868155,-0.014825598498744404,-0.014305084189602574,-0.015404430056206888,-0.013617417612095348,-0.02693687459808479,-0.025542563554703378,-0.005741555262049462,0.00865281622141147,0.001511397760850175,0.005827099846636984,0.031529446318792724,0.007504775012500903,-0.012863855363026016]"
"1:10235"	"["T","TA"]"	"rs540431307"	"[0.989170870829994,0.45589529664586437,0.07456373287634475,0.05167066711118691,-0.3496979091278656,0.4011920905498073,-0.3291856676880175,-0.01002863257549891,0.25749922562854016,-0.2566591179163784,0.44599895972103737,-0.310441194598682,0.8020505230033537,-0.2752300904721687,0.8214149187108155,-0.03920455900499864,0.3603526914002074,0.9642902631203855,0.5036438278449906,0.22365514191935543]"
"1:10352"	"["T","TA"]"	"rs201106462"	"[-0.0020278618338467015,-0.004461260940655055,-0.026189263875130746,0.005355074856952288,0.011523131347578843,-0.007970166278363253,0.003296571384735143,-0.00649396809716581,0.014300983043055285,4.40261684968913E-4,0.003538993449066244,-8.286792318639081E-4,-0.020481793463538518,0.008636489557145695,0.00710222047012649,0.014516439256649201,0.014693430140656416,0.005001212977414611,0.016549539697944978,0.006061780447752583]"
"1:10511"	"["G","A"]"	"rs534229142"	"[-0.22979288348177693,-0.63180137240308,-0.2642197661671302,-0.3405353460261729,-0.2246034835820954,-0.13901193871683573,-0.3569940277656782,-0.0909523045346519,0.13585679735890965,0.024173136608572173,-0.3799387952000725,0.15860955045867584,-0.19425315063369158,0.014375059046760038,-0.21189394358371727,-0.28794279936083406,-0.24285737762283013,-0.22134029471797062,-0.18501834293857816,0.021753184745015277]"
"1:10539"	"["C","A"]"	"rs537182016"	"[-0.24804359262513834,-0.7466315459222219,-0.20205947593830098,-0.6517927032349672,-0.37240926870596797,-0.20646940743643485,-0.46798830905964944,-0.40397322572818833,0.29269677247906517,-0.8018879297190337,-1.2039720781574579,-0.8974650971076943,0.24981530667688068,-0.5484618372137677,0.9445202463042509,-0.22211340995039952,-0.053686820149640525,0.23281955836171025,-0.43378366289313103,-0.9586143160806833]"
"1:10542"	"["C","T"]"	"rs572818783"	"[30.939719605803663,48.783415439832766,6.739350041755941,61.189494422123985,9.726067145889424,21.799292474564655,22.261921133117312,16.232189172074975,-9.744005641630016,48.26832943863782,11.98306878858923,-6.821133363834209,-11.593961173560622,47.550041474034096,13.846639235703389,36.026027215365744,-9.418893236891918,19.65837983385381,33.95421687385974,83.3137124899285]"
"1:10616"	"["CCGCCGTTGCAAAGGCGCGCCG","C"]"	"1:10616_CCGCCGTTGCAAAGGCGCGCCG_C"	"[0.1572195636635454,-0.03380466544471702,0.10008297890229777,0.16838572990469847,-0.06865734474318705,0.09650852382583495,-0.050156361176896494,0.047417674622061416,-0.10872370770773801,0.06152917138145618,-0.17989300833614524,-0.009355772096424981,-0.050177672665647974,0.051020923197199024,-0.014341393681502381,-0.12347218381418407,-0.1675451785660034,-0.17914536174644388,-0.2511418897114047,0.08425392120678984]"
"1:10642"	"["G","A"]"	"rs558604819"	"[0.05569764017945619,0.37420858922910666,-0.7208325550938323,-0.1916126970575322,-0.8340579896021219,-0.6570988531363929,0.5198944261836436,-0.07383483925776889,0.3513901504802878,0.2677412534025106,-0.5468265382140379,-0.48199159768215777,-1.1770404732166548,-1.1885635716955438,-0.3188556174242601,0.1817088667146935,-0.6415826723687796,0.21262382922173031,-0.004160434941234663,-0.023916588391776874]"
"1:11008"	"["C","G"]"	"rs575272151"	"[0.01242166814237488,0.03263509520576859,0.0029011642783888466,0.01925529976620068,0.03227821661755542,0.021883418635852116,0.06437957339363902,6.992602915144286E-4,0.027012784894192243,-0.020915543585543827,0.03932005240599367,0.013229880666337143,0.031846916340306015,0.016521620844440307,0.0050513889209964245,0.017229061932873962,-0.009992510637037553,0.008859708896325202,-0.015770399904468206,0.0240271472515886]"
"1:11012"	"["C","G"]"	"rs544419019"	"[0.01242166814237488,0.03263509520576859,0.0029011642783888466,0.01925529976620068,0.03227821661755542,0.021883418635852116,0.06437957339363902,6.992602915144286E-4,0.027012784894192243,-0.020915543585543827,0.03932005240599367,0.013229880666337143,0.031846916340306015,0.016521620844440307,0.0050513889209964245,0.017229061932873962,-0.009992510637037553,0.008859708896325202,-0.015770399904468206,0.0240271472515886]"

The issue is that the 78,900 gwases are not completely independent. We have 400 brain regions and a 400 x 400 connectivity matrix, so 79,800 unique brain connections. Now I need to average to get the average beta for each region for each rsid. The issue is that the gwas results are distributed among 3,990 tables.

My first attempt was: for each rsid, loop through the tables, filter, and append the results to a list. So, for each rsid, I’ll have a list of 78,900 betas values that I can then average for each brain region.

import hail as hl
import pickle

hl.init()

rsid = 'rs367896724'

betas = []
for i in range(3990):
    print(f'gwas_betas_idx{i}_step20.tsv.bgz')
    file = f'/data/gwas/hail-tsvs/tsvs/gwas_betas_idx{i}_step20.tsv.bgz'
    data = hl.import_table(file)
    rs_data = data.filter(data.rsid == rsid)
    rs_struct = rs_data.take(1)
    lst = [float(x) for x in rs_struct[0].beta[1:-1].split(',')]
    betas.extend(lst)

with open(f'/data/gwas/hail-tsvs/rsid-betas/{rsid}.pkl', 'wb') as f:
	pickle.dump(betas, f)

This works, but it takes about a minute to run for each table, so for one rsid I’m looking at ~2.75 days of processing. My next attempt will be to distribute this in some way, probably distribute at the rsid level since I have so many of them, but I wanted to get your advice first.

Thank you in advance!

tpoterba · April 13, 2022, 5:58pm

It’s possible to import all the files as a single Hail Table with hl.import_table. You can use the source_file_field='filename' argument to get the input file as a field of the table as well, if you need. You could do something like:

ht = hl.import_table([all_the_tables], ...)
ht2 = ht.group_by(ht.rsid).aggregate(mean_beta = hl.agg.explode(lambda x: hl.agg.mean(x), ht.beta[1:-1]))

dmoracze · April 25, 2022, 4:41pm

Thanks, @tpoterba, I’ve tried this a few times. The general idea of grouping and aggregating all of the tables into one hail table makes sense. However, I’m having some difficulty figuring out how to implement what I want.

See above for an example of one of the tables. Note that the column beta is a str that looks like a list:

"[-0.004992247354191095,0.0026417819469886474,-0.014409962452715221,0.004777507278440492,-0.0056093847643026,0.0023188415398793264,-0.015489004864868155,-0.014825598498744404,-0.014305084189602574,-0.015404430056206888,-0.013617417612095348,-0.02693687459808479,-0.025542563554703378,-0.005741555262049462,0.00865281622141147,0.001511397760850175,0.005827099846636984,0.031529446318792724,0.007504775012500903,-0.012863855363026016]"

I need to do two things with this:

Convert the str to a list of floats
Group the tables by rsid, then append the lists. The order of the floats within each list and the order of the lists are important.

What I’m hoping for is that for each rsid I’ll have a list of all betas (79800) in order, so that I can then run something like the following code:

i, j = np.triu_indices(400, k=1)
M = np.zeros((400, 400))
M[i, j] = list_data
M[j, i] = list_data
np.mean(M,axis=0)

Essentially I’m taking the 78900 floats and converting it to a symmetric 400x400 matrix so that I can take the mean of each column (or row since it’s symmetric).

Do you have an idea of how I can implement any of the steps above in hail?

dmoracze · May 9, 2022, 6:51pm

Hello - I’m following up here.

This is what my tables look like

locus
alleles
rsid
beta
str	str	str	str
"1:10177"	"["A","AC"]"	"rs367896724"	"[-0.004992247354191095,0.0026417819469886474,-0.014409962452715221,0.004777507278440492,-0.0056093847643026,0.0023188415398793264,-0.015489004864868155,-0.014825598498744404,-0.014305084189602574,-0.015404430056206888,-0.013617417612095348,-0.02693687459808479,-0.025542563554703378,-0.005741555262049462,0.00865281622141147,0.001511397760850175,0.005827099846636984,0.031529446318792724,0.007504775012500903,-0.012863855363026016]"
"1:10235"	"["T","TA"]"	"rs540431307"	"[0.989170870829994,0.45589529664586437,0.07456373287634475,0.05167066711118691,-0.3496979091278656,0.4011920905498073,-0.3291856676880175,-0.01002863257549891,0.25749922562854016,-0.2566591179163784,0.44599895972103737,-0.310441194598682,0.8020505230033537,-0.2752300904721687,0.8214149187108155,-0.03920455900499864,0.3603526914002074,0.9642902631203855,0.5036438278449906,0.22365514191935543]"
"1:10352"	"["T","TA"]"	"rs201106462"	"[-0.0020278618338467015,-0.004461260940655055,-0.026189263875130746,0.005355074856952288,0.011523131347578843,-0.007970166278363253,0.003296571384735143,-0.00649396809716581,0.014300983043055285,4.40261684968913E-4,0.003538993449066244,-8.286792318639081E-4,-0.020481793463538518,0.008636489557145695,0.00710222047012649,0.014516439256649201,0.014693430140656416,0.005001212977414611,0.016549539697944978,0.006061780447752583]"
"1:10511"	"["G","A"]"	"rs534229142"	"[-0.22979288348177693,-0.63180137240308,-0.2642197661671302,-0.3405353460261729,-0.2246034835820954,-0.13901193871683573,-0.3569940277656782,-0.0909523045346519,0.13585679735890965,0.024173136608572173,-0.3799387952000725,0.15860955045867584,-0.19425315063369158,0.014375059046760038,-0.21189394358371727,-0.28794279936083406,-0.24285737762283013,-0.22134029471797062,-0.18501834293857816,0.021753184745015277]"
"1:10539"	"["C","A"]"	"rs537182016"	"[-0.24804359262513834,-0.7466315459222219,-0.20205947593830098,-0.6517927032349672,-0.37240926870596797,-0.20646940743643485,-0.46798830905964944,-0.40397322572818833,0.29269677247906517,-0.8018879297190337,-1.2039720781574579,-0.8974650971076943,0.24981530667688068,-0.5484618372137677,0.9445202463042509,-0.22211340995039952,-0.053686820149640525,0.23281955836171025,-0.43378366289313103,-0.9586143160806833]"
"1:10542"	"["C","T"]"	"rs572818783"	"[30.939719605803663,48.783415439832766,6.739350041755941,61.189494422123985,9.726067145889424,21.799292474564655,22.261921133117312,16.232189172074975,-9.744005641630016,48.26832943863782,11.98306878858923,-6.821133363834209,-11.593961173560622,47.550041474034096,13.846639235703389,36.026027215365744,-9.418893236891918,19.65837983385381,33.95421687385974,83.3137124899285]"
"1:10616"	"["CCGCCGTTGCAAAGGCGCGCCG","C"]"	"1:10616_CCGCCGTTGCAAAGGCGCGCCG_C"	"[0.1572195636635454,-0.03380466544471702,0.10008297890229777,0.16838572990469847,-0.06865734474318705,0.09650852382583495,-0.050156361176896494,0.047417674622061416,-0.10872370770773801,0.06152917138145618,-0.17989300833614524,-0.009355772096424981,-0.050177672665647974,0.051020923197199024,-0.014341393681502381,-0.12347218381418407,-0.1675451785660034,-0.17914536174644388,-0.2511418897114047,0.08425392120678984]"
"1:10642"	"["G","A"]"	"rs558604819"	"[0.05569764017945619,0.37420858922910666,-0.7208325550938323,-0.1916126970575322,-0.8340579896021219,-0.6570988531363929,0.5198944261836436,-0.07383483925776889,0.3513901504802878,0.2677412534025106,-0.5468265382140379,-0.48199159768215777,-1.1770404732166548,-1.1885635716955438,-0.3188556174242601,0.1817088667146935,-0.6415826723687796,0.21262382922173031,-0.004160434941234663,-0.023916588391776874]"
"1:11008"	"["C","G"]"	"rs575272151"	"[0.01242166814237488,0.03263509520576859,0.0029011642783888466,0.01925529976620068,0.03227821661755542,0.021883418635852116,0.06437957339363902,6.992602915144286E-4,0.027012784894192243,-0.020915543585543827,0.03932005240599367,0.013229880666337143,0.031846916340306015,0.016521620844440307,0.0050513889209964245,0.017229061932873962,-0.009992510637037553,0.008859708896325202,-0.015770399904468206,0.0240271472515886]"
"1:11012"	"["C","G"]"	"rs544419019"	"[0.01242166814237488,0.03263509520576859,0.0029011642783888466,0.01925529976620068,0.03227821661755542,0.021883418635852116,0.06437957339363902,6.992602915144286E-4,0.027012784894192243,-0.020915543585543827,0.03932005240599367,0.013229880666337143,0.031846916340306015,0.016521620844440307,0.0050513889209964245,0.017229061932873962,-0.009992510637037553,0.008859708896325202,-0.015770399904468206,0.0240271472515886]"

I need to:

convert beta to a list of floats.
I’ve gotten as far as converting to a list of strings: data.beta[1:-1].split(','), but how to I make this field a list of floats instead of strings? I feel like there must be a more efficient way than using list comprehension on every single row, but even then I’m not sure if that’s possible in hail?
I need to group many tables by rsid and extend beta accordingly.
That is, I have table1, table2, table3, etc… all of which have the same rsid but 20 different betas. Ultimately I need one long list per rsid.
Once I have the long list of numbers, I have a function I need to apply to them. Essentially,

i, j = np.triu_indices(400, k=1)
M = np.zeros((400, 400))
M[i, j] = betas
M[j, i] = betas
np.mean(M,axis=0)

Here, I’m projecting a list of 79,800 values into a 400 x 400 matrix and taking the mean of each row. Ultimately I need to have a list of 400 floats per rsid

Help on any of the above steps would be great. I’ve been stuck for a couple of weeks.

danking · May 9, 2022, 7:14pm

Eep, did Hail generate this file? We should fix our exporters if that’s the case. The "["A","AC"]" looks like a real PITA to parse.

I think you want this:

import hail as hl 
ht = hl.import_table('bar.tsv', find_replace=("\"(\t)\"", "\t"), filter='str\tstr\tstr\tstr') 
ht = ht.select( 
    locus = hl.parse_locus(ht.locus[1:]), 
    rsid = ht.rsid, 
    alleles = hl.parse_json(ht.alleles, hl.tarray(hl.tstr)), 
    beta = hl.parse_json(ht.beta[:-1], hl.tarray(hl.tfloat)) 
)
ht = ht.key_by(ht.locus)

The quick and dirty solution is to join these twenty tables. You’ll probably want to do a binary tree of joins. We have faster ways to do this but they’re more complicated. Try this first and see if its fast enough. Here’s an example for joining two tables. You’ll want to do this recursively. So you’ll join the 20 files into 10 files, then those ten into 5, then maybe just join the last five all together explicitly.

ht = ht.anntoate(betas = ht.betas + ht2[ht.locus].betas)

I’m not sure I follow this upper triangular thing. If you want the mean of each row then do this:

ht = ht.anntoate(mean = hl.mean(ht.betas))

dmoracze · May 9, 2022, 8:02pm

EDIT: Sorry, I should have clarified. The table I pasted was python output of ht.show(), not the raw contents of the file itself.

Yes, Hail created these files, but it’s likely that the issue had to do with my specific use (and maybe error). I was running batches of 20 gwases at a time.

	batch_results = hl.linear_regression_rows(
	    [geno_subset_mt.phenos[i] for i in range(batch_size)],
	    geno_subset_mt.dosage,
	    covariates=[1.0],
	    pass_through=[geno_subset_mt.rsid, geno_subset_mt.mfi.chrom],
	    block_size=128)

	# save the results
	res = batch_results.select('rsid', 'beta')
	res = res.filter(hl.is_nan(res.beta[0]), keep=False)
	res.export(f'/gwas/tsvs/gwas_betas_idx{settings["idx"]}_step{settings["step"]}.tsv.bgz')

In hindsight, I probably should have saved the hail table rather than exporting it to a tsv.bgz. I was trying to minimize the number of files on disk (because I had 3990 batches of 20 gwases each).

When I hl.import_table(tsv.bgz) The file imports fine, and I have separate fields. The only issue is that beta is a “list-looking” string but I need it to be a proper list of floats.

Eg, I have:

"[-0.004992247354191095,0.0026417819469886474,-0.014409962452715221,0.004777507278440492,-0.0056093847643026,0.0023188415398793264,-0.015489004864868155,-0.014825598498744404,-0.014305084189602574,-0.015404430056206888,-0.013617417612095348,-0.02693687459808479,-0.025542563554703378,-0.005741555262049462,0.00865281622141147,0.001511397760850175,0.005827099846636984,0.031529446318792724,0.007504775012500903,-0.012863855363026016]"

I want:

[-0.004992247354191095,0.0026417819469886474,-0.014409962452715221,0.004777507278440492,-0.0056093847643026,0.0023188415398793264,-0.015489004864868155,-0.014825598498744404,-0.014305084189602574,-0.015404430056206888,-0.013617417612095348,-0.02693687459808479,-0.025542563554703378,-0.005741555262049462,0.00865281622141147,0.001511397760850175,0.005827099846636984,0.031529446318792724,0.007504775012500903,-0.012863855363026016]

danking · May 9, 2022, 10:30pm

Ah, OK, so, to load those files directly is actually quite straightforward. See below. Yeah, if you do this again you should use the Hail Table format rather than TSVs.

import hail as hl 
ht = hl.import_table('bar.tsv.gz')) 
ht = ht.select( 
    locus = hl.parse_locus(ht.locus), 
    rsid = ht.rsid, 
    alleles = hl.parse_json(ht.alleles, hl.tarray(hl.tstr)), 
    beta = hl.parse_json(ht.beta, hl.tarray(hl.tfloat)) 
)
ht = ht.key_by(ht.locus)

dmoracze · May 10, 2022, 2:16pm

Yes, @danking this importing code worked. Thank you! The data are now in the proper format.
Next step is to merge the tables.

danking:

The quick and dirty solution is to join these twenty tables. You’ll probably want to do a binary tree of joins. We have faster ways to do this but they’re more complicated. Try this first and see if its fast enough. Here’s an example for joining two tables. You’ll want to do this recursively. So you’ll join the 20 files into 10 files, then those ten into 5, then maybe just join the last five all together explicitly.
ht = ht.anntoate(betas = ht.betas + ht2[ht.locus].betas)

For each table, ht.beta is an <ArrayNumericExpression of type array<float64>> that contains 20 numbers for each locus. I have 3990 of these tables. Ultimately I need an array of 79800 numbers (20 * 3990) to do the next step. So, something like:

main_ht = hl.import_table('table1.tsv.gz')
other_tables = ['table2.tsv.gz', 'table3.tsv.gz', 'table4.tsv.gz', etc...]

for tt in other_tables:
    temp_ht =  hl.import_table(tt)
    main_ht = main_ht.annotate(beta = main_ht.beta + temp_ht[main_ht.locus].beta)

Or, rather than extending main_ht maybe creating an empty table and progressively building on each iteration of the loop?

I have tried your suggestion:

ht = ht.anntoate(betas = ht.betas + ht2[ht.locus].betas)

But it is not joining the betas for the 2nd table:

len(ht.beta.take(1)[0])
20

len(ht2.beta.take(1)[0])
20

ht = ht.anntoate(betas = ht.betas + ht2[ht.locus].betas)
len(ht.beta.take(1)[0])
20

EDIT: The following works

ht = ht.anntoate(betas = ht.betas.extend(ht2[ht.locus].betas))

dmoracze · June 14, 2022, 2:30pm

Hello - I’m in the process of combining my 3990 tables into one, according to the method suggested here. (combine in a few different stages).

I’ve gone through the first stage (3990 to 200 tables. Now I’m at the next stage, trying to combine the 200 tables into 10 and I’m hitting an error. I am using the same code and procedure for this stage as the previous last stage, aside from converting the beta values to float32:

hl.init(spark_conf={'spark.driver.memory': '256g','spark.executor.memory': '256g'})

# lst is a list of hail tables files names. There are 20 items in lst
ht = hl.read_table(f'{top}/round1_reduce/{lst[0]}') 
ht = ht.annotate(beta=ht.beta.map(hl.float32))

for i in lst[1:]:
    temp = hl.read_table(f'{top}/round1_reduce/{i}') 
    temp = temp.annotate(beta=temp.beta.map(hl.float32))

    ht = ht.annotate(beta = ht.beta.extend(temp[ht.locus].beta))

# this bit is just to get the filename for the output table ready
low = lst[0].split('_')[-1].split('-')[0]
high = lst[-1].split('_')[-1].split('-')[1].split('.ht')[0]

ht.write(f'{top}/round2_reduce/gwas_betas_{low}-{high}.ht')

And here is the error and output from hail

2022-06-09 19:48:36 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[Stage 0:>                                                       (0 + 32) / 134]
Running on Apache Spark version 3.1.2
SparkUI available at http://cn1065:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.94-f0b38d6c436f
LOGGING: writing to /vf/users/Ukbb_DSST/DSST/dmoracze/gwas/hail-tsvs/swarm/hail-20220609-1948-0.2.94-f0b38d6c436f.log
Traceback (most recent call last):
  File "/data/Ukbb_DSST/DSST/dmoracze/gwas/hail-tsvs/reduce_tsvs_round2.py", line 46, in <module>
    Main()
  File "/data/Ukbb_DSST/DSST/dmoracze/gwas/hail-tsvs/reduce_tsvs_round2.py", line 43, in Main
    hl_wf(chunk)
  File "/data/Ukbb_DSST/DSST/dmoracze/gwas/hail-tsvs/reduce_tsvs_round2.py", line 24, in hl_wf
    ht.write(f'{top}/round2_reduce/gwas_betas_{low}-{high}.ht')
  File "<decorator-gen-1016>", line 2, in write
  File "/home/dmoracze/miniconda3/lib/python3.7/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/home/dmoracze/miniconda3/lib/python3.7/site-packages/hail/table.py", line 1335, in write
    Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
  File "/home/dmoracze/miniconda3/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 104, in execute
    self._handle_fatal_error_from_backend(e, ir)
  File "/home/dmoracze/miniconda3/lib/python3.7/site-packages/hail/backend/backend.py", line 181, in _handle_fatal_error_from_backend
    raise err
  File "/home/dmoracze/miniconda3/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    result_tuple = self._jbackend.executeEncode(jir, stream_codec)
  File "/home/dmoracze/miniconda3/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/dmoracze/miniconda3/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 31, in deco
    raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: ClassFormatError: Too many arguments in method signature in class file __C1809collect_distributed_array

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 20 in stage 0.0 failed 1 times, most recent failure: Lost task 20.0 in stage 0.0 (TID 20) (cn1065 executor driver): java.lang.ClassFormatError: Too many arguments in method signature in class file __C1809collect_distributed_array
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:642)
	at is.hail.asm4s.HailClassLoader.liftedTree1$1(HailClassLoader.scala:10)
	at is.hail.asm4s.HailClassLoader.loadOrDefineClass(HailClassLoader.scala:6)
	at is.hail.asm4s.ClassesBytes.$anonfun$load$1(ClassBuilder.scala:64)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at is.hail.asm4s.ClassesBytes.load(ClassBuilder.scala:62)
	at is.hail.expr.ir.EmitClassBuilder$$anon$1.apply(EmitClassBuilder.scala:717)
	at is.hail.expr.ir.EmitClassBuilder$$anon$1.apply(EmitClassBuilder.scala:710)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$2(BackendUtils.scala:31)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$1(BackendUtils.scala:30)
	at is.hail.backend.spark.SparkBackendComputeRDD.compute(SparkBackend.scala:759)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
	at is.hail.backend.spark.SparkBackend.parallelizeAndComputeWithIndex(SparkBackend.scala:319)
	at is.hail.backend.BackendUtils.collectDArray(BackendUtils.scala:28)
	at __C3Compiled.__m5split_Let_region233_244(Emit.scala)
	at __C3Compiled.__m5split_Let_region36_250(Emit.scala)
	at __C3Compiled.__m5split_Let(Emit.scala)
	at __C3Compiled.apply(Emit.scala)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$3(CompileAndEvaluate.scala:57)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:57)
	at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:30)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:30)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:414)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:450)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:69)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:69)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:58)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:308)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:447)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:446)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

java.lang.ClassFormatError: Too many arguments in method signature in class file __C1809collect_distributed_array
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:642)
	at is.hail.asm4s.HailClassLoader.liftedTree1$1(HailClassLoader.scala:10)
	at is.hail.asm4s.HailClassLoader.loadOrDefineClass(HailClassLoader.scala:6)
	at is.hail.asm4s.ClassesBytes.$anonfun$load$1(ClassBuilder.scala:64)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at is.hail.asm4s.ClassesBytes.load(ClassBuilder.scala:62)
	at is.hail.expr.ir.EmitClassBuilder$$anon$1.apply(EmitClassBuilder.scala:717)
	at is.hail.expr.ir.EmitClassBuilder$$anon$1.apply(EmitClassBuilder.scala:710)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$2(BackendUtils.scala:31)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$1(BackendUtils.scala:30)
	at is.hail.backend.spark.SparkBackendComputeRDD.compute(SparkBackend.scala:759)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)




Hail version: 0.2.94-f0b38d6c436f
Error summary: ClassFormatError: Too many arguments in method signature in class file

I tried to search this message board but couldn’t find a solution. I also tried to increase the spark memory to 512GB, but this didn’t work. I’m working on a node with 512GB memory and 4 CPUS.

Topic		Replies	Views
Long time to export UK Biobank GWAS result to tsv file Hail Query & hailctl	2	908	April 27, 2020
Improve writing time for GWAS results Hail Query & hailctl	2	461	November 20, 2020
Export GWAS summary statistics to a .txt file Hail Query & hailctl	8	1120	February 22, 2022
GWAS results with 0 rows Hail Query & hailctl	16	563	August 23, 2021
Hail 0.2 export gwas results to tsv Help [0.1]	7	997	October 18, 2018

Requesting advice on efficiently parsing through many GWAS results

Related topics