Hello,
I have been testing Hail on an AWS Spark cluster and using a S3 bucket to store MatrixTables. I uploaded a test dataset from a local machine into the S3 and I can read it fine with read_matrix_table. With this test dataset, I run sample_qc, run a PCA and remove related samples etc.
If I keep the modified MatrixTable in memory, and try to train the gnomad module’s variant recalibration random forest model, it run with no problems. However, If I write the modified MatrixTable into the S3 and try read_matrix_table it the random forest model training does not work, giving me an AssertionError. Moreover, if I try to show() entries records (GT for example) it gives me the same error.
Any help is appreciated. Thanks!
assertionerror.txt (35.4 KB)
The error suggests this is a bug in Hail. We recently changed some defaults so that we’d run a faster, but less battle-tested codepath, and it’s possible that doing so exposed this.
As a way to check that, can you try putting:
hl._set_flags(no_whole_stage_codegen='1')
at the top of your script? If that resolves your issue, it narrows things down for us a bit w.r.t. determining the source of the problem.
Tried the flag, but it runs into an another error:
import hail as hl
hl._set_flags(no_whole_stage_codegen=‘1’)
error_flag.txt (7.0 KB)
What does your script start with? Do you call hl.init
explicitly?
Yes, this is what the script preamble at the moment:
import hail as hl
from bokeh.embed import file_html
from bokeh.resources import CDN
from gnomad.utils.filtering import filter_to_autosomes
from gnomad.utils.annotations import add_variant_type
from gnomad.utils.annotations import bi_allelic_site_inbreeding_expr
from gnomad.variant_qc.random_forest import apply_rf_model, median_impute_features
from gnomad.variant_qc.pipeline import train_rf_model
DEFAULT_REF = "GRCh38"
hl.init(
sc,
idempotent=True,
quiet=True,
skip_logging_configuration=True,
default_reference=DEFAULT_REF
)
try setting the flag after your call to init
, I think.
Nice! The flag seems to have solved the read problems, but I got another error during the random forest model. I don’t know if you can help me since it seems to be happening during the execution of a gnomad’s function apply_rf_model(), but any input is appreciated. I will mark as solution. If you think I should open a new thread please tell me.
Thanks!
TaskCompletionListenerException.txt (20.1 KB)
# %% Training the model
rf_trained_ht, rf_model = train_rf_model(
rf_ht,
rf_features=features,
tp_expr=rf_ht.tp,
fp_expr=rf_ht.fp,
test_expr=hl.literal(test_intervals).any(
lambda interval: interval.contains(rf_ht.locus)
),
)
# %% Joining results
ht = rf_ht.join(rf_trained_ht, how="left")
--> Failing here <--
rf_results = apply_rf_model(
ht=ht,
rf_model=rf_model,
features=hl.eval(rf_trained_ht.features),
label="rf_label",
prediction_col_name="rf_prediction",
)
I think moving that question to a new thread would be for the best, yeah.
Separately, I’d still like to fix the problem you were running into initially. Setting a flag is a temporary fix that will eventually go away, real problem still needs to be fixed. Could you share the script you were running? You mentioned a “test dataset”, is that a public one? If I could reproduce your error things would be easier.