AssertionError when trying to operate with MatrixTable read from S3

Hello,

I have been testing Hail on an AWS Spark cluster and using a S3 bucket to store MatrixTables. I uploaded a test dataset from a local machine into the S3 and I can read it fine with read_matrix_table. With this test dataset, I run sample_qc, run a PCA and remove related samples etc.

If I keep the modified MatrixTable in memory, and try to train the gnomad module’s variant recalibration random forest model, it run with no problems. However, If I write the modified MatrixTable into the S3 and try read_matrix_table it the random forest model training does not work, giving me an AssertionError. Moreover, if I try to show() entries records (GT for example) it gives me the same error.

Any help is appreciated. Thanks!

assertionerror.txt (35.4 KB)

The error suggests this is a bug in Hail. We recently changed some defaults so that we’d run a faster, but less battle-tested codepath, and it’s possible that doing so exposed this.

As a way to check that, can you try putting:

hl._set_flags(no_whole_stage_codegen='1')

at the top of your script? If that resolves your issue, it narrows things down for us a bit w.r.t. determining the source of the problem.

Tried the flag, but it runs into an another error:

import hail as hl
hl._set_flags(no_whole_stage_codegen=‘1’)

error_flag.txt (7.0 KB)

What does your script start with? Do you call hl.init explicitly?

Yes, this is what the script preamble at the moment:

import hail as hl

from bokeh.embed import file_html
from bokeh.resources import CDN

from gnomad.utils.filtering import filter_to_autosomes
from gnomad.utils.annotations import add_variant_type

from gnomad.utils.annotations import bi_allelic_site_inbreeding_expr
from gnomad.variant_qc.random_forest import apply_rf_model, median_impute_features
from gnomad.variant_qc.pipeline import train_rf_model

DEFAULT_REF = "GRCh38"

hl.init(
    sc,
    idempotent=True,
    quiet=True,
    skip_logging_configuration=True,
    default_reference=DEFAULT_REF
)

try setting the flag after your call to init, I think.

Nice! The flag seems to have solved the read problems, but I got another error during the random forest model. I don’t know if you can help me since it seems to be happening during the execution of a gnomad’s function apply_rf_model(), but any input is appreciated. I will mark as solution. If you think I should open a new thread please tell me.

Thanks!

TaskCompletionListenerException.txt (20.1 KB)

# %% Training the model
rf_trained_ht, rf_model = train_rf_model(
    rf_ht,
    rf_features=features,
    tp_expr=rf_ht.tp,
    fp_expr=rf_ht.fp,
    test_expr=hl.literal(test_intervals).any(
        lambda interval: interval.contains(rf_ht.locus)
    ),
)

# %% Joining results
ht = rf_ht.join(rf_trained_ht, how="left")

--> Failing here <--
rf_results = apply_rf_model(
    ht=ht,
    rf_model=rf_model,
    features=hl.eval(rf_trained_ht.features),
    label="rf_label",
    prediction_col_name="rf_prediction",
)

I think moving that question to a new thread would be for the best, yeah.

Separately, I’d still like to fix the problem you were running into initially. Setting a flag is a temporary fix that will eventually go away, real problem still needs to be fixed. Could you share the script you were running? You mentioned a “test dataset”, is that a public one? If I could reproduce your error things would be easier.