I am trying to get kinship information for 590 WGS samples. The number of variants is over 38M.
kinship = hl.king(mt.GT)
With the script as above, I got this error message as below.
FatalError: HailException: Cannot create BlockMatrix: filtered entry at row 8491008 and col 406
Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2073 in stage 14.0 failed 20 times, most recent failure: Lost task 2073.19 in stage 14.0 (TID 426052, js-c19-sw-mv7r.c.gbsc-gcp-project-mvp.internal, executor 1570): is.hail.utils.HailException: Cannot create BlockMatrix: filtered entry at row 8491008 and col 406
Please let me know how to interpret this error message. Thank you.
This log file includes a full error message.
king_20210302.log (13.7 KB)
This means your matrix table has filtered entries.
hl.king currently doesn’t handle filter entries. I’ll fix that. In the mean time just add
mt = mt.unfilter_entries() before you call
Thank you so much for your efforts. I will retry it.
After adding mt.unfilter_entries(), the error message was gone. Thanks again.
By the way, this function has been running continuously for more than 16 hours. This work is on Dataproc on GCP with autoscaling which can use up to 4000 cores.
Can we consider this to be a normal process for king()? I just want to know if I should wait for a long time or quit this work and try to solve the issue.
Are you using all 38 million variants? You probably do not need all 38 million variants. Most folks that I know use about five to ten thousand common variants. Each rare variant contributes very little information to relatedness.
KING, like most relatedness methods, scales like N_SAMPLES^2 * N_VARIANTS, so using all 38 million variants is very time consuming.
Got it. Do you have a specific way to select essential common variants for the kinship analysis in Hail? Thank you for supporting this work continuously.
That’s a biology question and I’m not really a biologist. Using only those variants with at least 5% or at least 1% minor allele frequency seems reasonable to me.
I am not a biologist either. Your comment is very helpful enough for me to get directions. Thanks a lot!!