Hail syntax for this is:
Hail syntax for this is:
Hi there. I’ve read in all of the bgen files and subset them using the above command. Now I just want to count the number of samples in the subset to make sure that I have all of the samples.
However, counting the subset matrix table is taking a really long time.
genotype_info = hl.import_bgen(bgen22Files, entry_fields=['dosage'], sample_file=sampleFile) genotype_info_subset = genotype_info.filter_cols(hl.literal(subjs).contains(genotype_info.s))
This works very quickly:
This is taking a really really long time:
Should I be overwriting
genotype_info? Maybe Hail did some indexing when I created the matrix table and it needs to reindex after I subset?
I’m reading in my idpsFile and I’m getting:
2021-04-08 11:37:31 Hail: INFO: Ordering unsorted dataset with network shuffle
What does this mean? The idpsFile is a 11,533 x 79,800 csv of floats. Here is an example entry:
It is essential that I keep the same ordering of this matrix. The rows and columns mean something very specific (rows are subjects, columns are brain functional connectivity edges). Please let me know if there is a better way to do this. Here is how I was thinking of setting this up:
s. The rows in
idpsFilecorrespond to the order of the sample text file (how would I “tag” the idpsFile matrix table so that the rows are associated with the sample IDs)?
This is probably a Hail performance bug wherein it doesn’t realize it need not look at the genotype data.
genotype_info_subset.count_cols() should be fast.
Does just the import-and-write operation cause this error? Concretely, does running only this show that message? If it does, that’s a Hail bug. Can you share a simplified file that triggers this?
import hail as hl pheno_mt = hl.import_matrix_table(idpsFile, no_header=True) pheno_mt.write('some_path.mt')
Thanks for your response.
count_cols() is much faster. Thank you.
Yes, just reading and writing causes the error. However, I also have to specify the dtype. If I do not I get:
Error summary: MatrixParseError: file:/data/Ukbb_DSST/DSST/dmoracze/gwas/head_idps.csv:2-3, invalid character '.' in integer literal
If I try:
pheno_mt = hl.import_matrix_table(idpsFile, no_header=True,entry_type=hl.tstr)
but it doesn’t error out. However the data are not strings, they are float (eg
-0.820223875951066), but it I try
entry_type=hl.tfloat64 I get a really long error beginning with:
FatalError: NumberFormatException: For input string:
In bash, I used
head to grab to top of the file to see if it would recreate the error. It doesn’t but it gives me:
2021-04-08 14:40:52 Hail: INFO: Coerced sorted dataset
The full file is 16GB in size. Do you think if I rounded those floats Hail would be happier?
I have attached the first 10 rows and 10 columns from the full
samp.txt (1.8 KB)
This command should not trigger that message because it is not an action (write, show, collect, etc.). Some subsequent operation is causing that message. What is the full series of commands that triggers that error? What version of Hail are you using? Latest is currently 0.2.64.
Can you try this instead?
import hail as hl mt = hl.import_matrix_table('samp.txt', no_header=True, entry_type=hl.tfloat, delimiter=',') mt.show()
If you get the number format error again, paste the whole thing here or at some paste service (like gist.github.com) and share the link here and I can dig in further.
EDIT: Do you have missing values? How are they represented?
hl.import_matrix_table uses the string
NA by default, but you can override that.
I’m using Hail 0.2.45.
Sorry, I wasn’t clear in the above message and it appears that I get the warning when I write the matrix table, not read it.
import hail as hl mt = hl.import_matrix_table('samp.txt', no_header=True, entry_type=hl.tfloat, delimiter=',')
This works, no errors or warnings, but
mt.show() gives me a whole host of errors. This is the bottom, I can paste the hundreds of lines in the error if you want.
Hail version: 0.2.45-a45a43f21e83 Error summary: SparkException: Failed to get broadcast_5_piece0 of broadcast_5
When I write the matrix table, I get the warning:
mt.write('samp.mt') 2021-04-08 15:29:41 Hail: INFO: Coerced sorted dataset 2021-04-08 15:29:41 Hail: INFO: wrote matrix table with 10 rows and 10 columns in 1 partition to samp.mt
You should update to hail 0.2.64. You’re missing out on 9 extra months of development, it’s possible things have improved and some of your problems have been addressed.
Oof! That’s almost a year old. Would you mind updating to the latest version of Hail and trying again?
The broadcast issue is perplexing, I’d need the complete stack trace to address that. It looks likely to be some spark configuration issue.
EDIT: it looks like the write was successful anyway.
Thanks. I need to put this project down for today. I’ll update and try tomorrow.
Yes, I just want to make sure that the order in which it was written was the order it was read because the row order is tied to the samples. So I wanted to know whether are ordering or sorting was happening.
I’m quite confident this snippet changes neither the row nor column ordering:
import hail as hl mt = hl.import_matrix_table('samp.txt', no_header=True, entry_type=hl.tfloat, delimiter=',') mt.write('samp.mt')
If it does, that’d be a serious error in Hail that I’d like to investigate further.
Hello - I’m trying to incorporate the method outlined by @tpoterba above:
However saving the subset genotype data as a matrix table is taking a really long time.
genotype_info = hl.import_bgen(bgen22Files, entry_fields=['dosage'], sample_file=sampleFile) genotype_info_subset = genotype_info.filter_cols(hl.literal(subjs).contains(genotype_info.s)) genotype_info_subset.write('geno_subset.mt')
bgen22Files is a list of all the bgen files for the 22 chromosomes. I’m subsetting by
subjs which is a list of 11,533 sample IDs. The
genotype_info_subset.write('geno_subset.mt') call has been running for almost 24 hours.
Is it necessary to save the subset genotype data as a matrix table? Or should I just merge the pheno and the geno data and move on, not saving it?
Another question that I’m going to have: how should I optimize the gwas to run on my HPC? I’ve got 79,800 gwases to run. Should I manually distribute these jobs using our job scheduler or is @tpoterba’s suggestion of working in batches the most efficient way to go about this?
The UKB BGEN files are very large and selecting out 11k of 500k columns is a time-consuming operation. It is not necessary to save the genotype data as a matrix table. BGEN an efficient file format, I recommend using it directly.
I don’t think there’s any optimization that you can do. Hail could be faster (we’re working on it!), but Hail was primarily designed to scale, on a Spark cluster, to many cores. That’s how most folks use Hail on the UKB: they use a very large number of cores for a short period of time to quickly finish an analysis. Because Hail scales well, the cost is roughly the same whether you use 1 core for 10,000 hours or 10,000 cores for 1 hour.
EDIT: 79,800 GWASes on every variant in the BGEN for 11,000 is a tremendous amount of computation. The Neale Lab ran ~2k GWASes, on every variant, on 337,000 individuals, and it took a couple days with thousands of cores. The analysis you want to do is certainly possible, but I think only in an environment where you can lease thousands of cores for a couple of days.
Thanks for the info! I’m not worried too much about compute time, I just want to make sure I’m preparing in the most efficient way as possible. I’ll be working on the Biowulf cluster at the NIH.
EDIT: Once I know the most efficient way to use Hail, then I’ll benchmark the jobs and I can work with the cluster support staff to efficiently request resources.
I’m having an issue in my process of subsetting and merging the genotype and phenotype data.
# phenotype # 11533 x 79800 csv pheno_mt = hl.import_matrix_table(idpsFile, no_header=True, entry_type=hl.tfloat, delimiter=',') subjs # list of 11533 strings of subject IDs (samples) # genotype genotype_info = hl.import_bgen(bgen22Files, entry_fields=['dosage'], sample_file=sampleFile) genotype_info_subset = genotype_info.filter_cols(hl.literal(subjs).contains(genotype_info.s))
When I subset
genotype_info based on
subjs, only 11224 samples are returned. I take it this means I don’t have genotype data for some of the subjects I have phenotype data for. That’s fine.
However, I want to merge these by sample /
pheno_mt are separate I’m not sure how to perform this merge. Ideally I’d have the row_key for
I’ve tried to use
genotype_info_subset.s to subset
subjs[genotype_info_subset.s] TypeError: list indices must be integers or slices, not StringExpression
I’ve tried to annotate columns of
subjs but this puts the entire subject list at each row. And I’ve tried to replace the row key of
TypeError: key_cols_by: parameter '*keys' (arg 0 of 1): expected (str or hail.expr.expressions.base_expression.Expression), found list:
@danking since I’m wanting to run 79,800 separate gwases, distributed across our cluster. Which approach do you think is more effeicient?
Either way I’ll need a way to preserve the pheno column index, so I’ll likely write a script that takes a column index as an argument.
I think you want option 3 –
Ah good call, thanks.
You suggested this earlier in the thread, sorry to make you repeat yourself!
@tpoterba your method seems to be working well, though saving the results takes a really really long time. Is that to be expected?
batch_results = hl.linear_regression_rows( [geno_subset_mt.phenos[i] for i in range(batch_size)], geno_subset_mt.dosage, covariates=[1.0], block_size=128) batch_results.write('test') # takes a really long time
Yes, absolutely! Hail’s interfaces are lazy; no computational work is happening until your
write line. The time for the
write includes the time for the regressions (which is the majority of the work).