I have been testing Hail on an AWS Spark cluster and using a S3 bucket to store MatrixTables. I uploaded a test dataset from a local machine into the S3 and I can read it fine with read_matrix_table. With this test dataset, I run sample_qc, run a PCA and remove related samples etc.
If I keep the modified MatrixTable in memory, and try to train the gnomad module’s variant recalibration random forest model, it run with no problems. However, If I write the modified MatrixTable into the S3 and try read_matrix_table it the random forest model training does not work, giving me an AssertionError. Moreover, if I try to show() entries records (GT for example) it gives me the same error.
Any help is appreciated. Thanks!
assertionerror.txt (35.4 KB)