Hail syntax for this is:
filter_cols(hl.literal(subjs).contains(genotype_info.s))
Hail syntax for this is:
filter_cols(hl.literal(subjs).contains(genotype_info.s))
Hi there. I’ve read in all of the bgen files and subset them using the above command. Now I just want to count the number of samples in the subset to make sure that I have all of the samples.
However, counting the subset matrix table is taking a really long time.
genotype_info = hl.import_bgen(bgen22Files, entry_fields=['dosage'], sample_file=sampleFile)
genotype_info_subset = genotype_info.filter_cols(hl.literal(subjs).contains(genotype_info.s))
This works very quickly:
genotype_info.count()
This is taking a really really long time:
genotype_info_subset.count()
Should I be overwriting genotype_info
? Maybe Hail did some indexing when I created the matrix table and it needs to reindex after I subset?
I’m reading in my idpsFile and I’m getting:
2021-04-08 11:37:31 Hail: INFO: Ordering unsorted dataset with network shuffle
What does this mean? The idpsFile is a 11,533 x 79,800 csv of floats. Here is an example entry: -0.820223875951066
It is essential that I keep the same ordering of this matrix. The rows and columns mean something very specific (rows are subjects, columns are brain functional connectivity edges). Please let me know if there is a better way to do this. Here is how I was thinking of setting this up:
genotype_info
by s
. The rows in idpsFile
correspond to the order of the sample text file (how would I “tag” the idpsFile matrix table so that the rows are associated with the sample IDs)?This is probably a Hail performance bug wherein it doesn’t realize it need not look at the genotype data. genotype_info_subset.count_cols()
should be fast.
Does just the import-and-write operation cause this error? Concretely, does running only this show that message? If it does, that’s a Hail bug. Can you share a simplified file that triggers this?
import hail as hl
pheno_mt = hl.import_matrix_table(idpsFile, no_header=True)
pheno_mt.write('some_path.mt')
Thanks for your response.
Yes, count_cols()
is much faster. Thank you.
Yes, just reading and writing causes the error. However, I also have to specify the dtype. If I do not I get:
Error summary: MatrixParseError: file:/data/Ukbb_DSST/DSST/dmoracze/gwas/head_idps.csv:2-3, invalid character '.' in integer literal
If I try:
pheno_mt = hl.import_matrix_table(idpsFile, no_header=True,entry_type=hl.tstr)
I get:
but it doesn’t error out. However the data are not strings, they are float (eg -0.820223875951066
), but it I try entry_type=hl.tfloat64
I get a really long error beginning with:
FatalError: NumberFormatException: For input string:
In bash, I used head
to grab to top of the file to see if it would recreate the error. It doesn’t but it gives me:
2021-04-08 14:40:52 Hail: INFO: Coerced sorted dataset
The full file is 16GB in size. Do you think if I rounded those floats Hail would be happier?
I have attached the first 10 rows and 10 columns from the full idpsFile
here.
samp.txt (1.8 KB)
This command should not trigger that message because it is not an action (write, show, collect, etc.). Some subsequent operation is causing that message. What is the full series of commands that triggers that error? What version of Hail are you using? Latest is currently 0.2.64.
Can you try this instead?
import hail as hl
mt = hl.import_matrix_table('samp.txt',
no_header=True,
entry_type=hl.tfloat,
delimiter=',')
mt.show()
If you get the number format error again, paste the whole thing here or at some paste service (like gist.github.com) and share the link here and I can dig in further.
EDIT: Do you have missing values? How are they represented? hl.import_matrix_table
uses the string NA
by default, but you can override that.
I’m using Hail 0.2.45.
Sorry, I wasn’t clear in the above message and it appears that I get the warning when I write the matrix table, not read it.
import hail as hl
mt = hl.import_matrix_table('samp.txt',
no_header=True,
entry_type=hl.tfloat,
delimiter=',')
This works, no errors or warnings, but mt.show()
gives me a whole host of errors. This is the bottom, I can paste the hundreds of lines in the error if you want.
Hail version: 0.2.45-a45a43f21e83
Error summary: SparkException: Failed to get broadcast_5_piece0 of broadcast_5
When I write the matrix table, I get the warning:
mt.write('samp.mt')
2021-04-08 15:29:41 Hail: INFO: Coerced sorted dataset
2021-04-08 15:29:41 Hail: INFO: wrote matrix table with 10 rows and 10 columns in 1 partition to samp.mt
You should update to hail 0.2.64. You’re missing out on 9 extra months of development, it’s possible things have improved and some of your problems have been addressed.
Oof! That’s almost a year old. Would you mind updating to the latest version of Hail and trying again?
The broadcast issue is perplexing, I’d need the complete stack trace to address that. It looks likely to be some spark configuration issue.
EDIT: it looks like the write was successful anyway.
Thanks. I need to put this project down for today. I’ll update and try tomorrow.
Yes, I just want to make sure that the order in which it was written was the order it was read because the row order is tied to the samples. So I wanted to know whether are ordering or sorting was happening.
I’m quite confident this snippet changes neither the row nor column ordering:
import hail as hl
mt = hl.import_matrix_table('samp.txt',
no_header=True,
entry_type=hl.tfloat,
delimiter=',')
mt.write('samp.mt')
If it does, that’d be a serious error in Hail that I’d like to investigate further.
Hello - I’m trying to incorporate the method outlined by @tpoterba above:
However saving the subset genotype data as a matrix table is taking a really long time.
genotype_info = hl.import_bgen(bgen22Files, entry_fields=['dosage'], sample_file=sampleFile)
genotype_info_subset = genotype_info.filter_cols(hl.literal(subjs).contains(genotype_info.s))
genotype_info_subset.write('geno_subset.mt')
bgen22Files
is a list of all the bgen files for the 22 chromosomes. I’m subsetting by subjs
which is a list of 11,533 sample IDs. The genotype_info_subset.write('geno_subset.mt')
call has been running for almost 24 hours.
Is it necessary to save the subset genotype data as a matrix table? Or should I just merge the pheno and the geno data and move on, not saving it?
Another question that I’m going to have: how should I optimize the gwas to run on my HPC? I’ve got 79,800 gwases to run. Should I manually distribute these jobs using our job scheduler or is @tpoterba’s suggestion of working in batches the most efficient way to go about this?
The UKB BGEN files are very large and selecting out 11k of 500k columns is a time-consuming operation. It is not necessary to save the genotype data as a matrix table. BGEN an efficient file format, I recommend using it directly.
I don’t think there’s any optimization that you can do. Hail could be faster (we’re working on it!), but Hail was primarily designed to scale, on a Spark cluster, to many cores. That’s how most folks use Hail on the UKB: they use a very large number of cores for a short period of time to quickly finish an analysis. Because Hail scales well, the cost is roughly the same whether you use 1 core for 10,000 hours or 10,000 cores for 1 hour.
EDIT: 79,800 GWASes on every variant in the BGEN for 11,000 is a tremendous amount of computation. The Neale Lab ran ~2k GWASes, on every variant, on 337,000 individuals, and it took a couple days with thousands of cores. The analysis you want to do is certainly possible, but I think only in an environment where you can lease thousands of cores for a couple of days.
Thanks for the info! I’m not worried too much about compute time, I just want to make sure I’m preparing in the most efficient way as possible. I’ll be working on the Biowulf cluster at the NIH.
EDIT: Once I know the most efficient way to use Hail, then I’ll benchmark the jobs and I can work with the cluster support staff to efficiently request resources.
I’m having an issue in my process of subsetting and merging the genotype and phenotype data.
# phenotype
# 11533 x 79800 csv
pheno_mt = hl.import_matrix_table(idpsFile, no_header=True, entry_type=hl.tfloat, delimiter=',')
subjs # list of 11533 strings of subject IDs (samples)
# genotype
genotype_info = hl.import_bgen(bgen22Files, entry_fields=['dosage'], sample_file=sampleFile)
genotype_info_subset = genotype_info.filter_cols(hl.literal(subjs).contains(genotype_info.s))
When I subset genotype_info
based on subjs
, only 11224 samples are returned. I take it this means I don’t have genotype data for some of the subjects I have phenotype data for. That’s fine.
However, I want to merge these by sample / s
/ subjs
. Since subjs
and pheno_mt
are separate I’m not sure how to perform this merge. Ideally I’d have the row_key for pheno_mt
be subjs
.
I’ve tried to use genotype_info_subset.s
to subset subj
subjs[genotype_info_subset.s]
TypeError: list indices must be integers or slices, not StringExpression
I’ve tried to annotate columns of pheno_mt
with subjs
but this puts the entire subject list at each row. And I’ve tried to replace the row key of pheno_mt
with pheno_mt.key_cols_by(subjs)
:
TypeError: key_cols_by: parameter '*keys' (arg 0 of 1): expected (str or hail.expr.expressions.base_expression.Expression), found list:
@danking since I’m wanting to run 79,800 separate gwases, distributed across our cluster. Which approach do you think is more effeicient?
or
Either way I’ll need a way to preserve the pheno column index, so I’ll likely write a script that takes a column index as an argument.
I think you want option 3 –
Ah good call, thanks.
You suggested this earlier in the thread, sorry to make you repeat yourself!
@tpoterba your method seems to be working well, though saving the results takes a really really long time. Is that to be expected?
batch_results = hl.linear_regression_rows(
[geno_subset_mt.phenos[i] for i in range(batch_size)],
geno_subset_mt.dosage,
covariates=[1.0],
block_size=128)
batch_results.write('test') # takes a really long time
Yes, absolutely! Hail’s interfaces are lazy; no computational work is happening until your write
line. The time for the write
includes the time for the regressions (which is the majority of the work).