Error of hl.king() run for kinship analysis

Hello,

I am trying to get kinship information for 590 WGS samples. The number of variants is over 38M.

kinship = hl.king(mt.GT)

With the script as above, I got this error message as below.

FatalError: HailException: Cannot create BlockMatrix: filtered entry at row 8491008 and col 406

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2073 in stage 14.0 failed 20 times, most recent failure: Lost task 2073.19 in stage 14.0 (TID 426052, js-c19-sw-mv7r.c.gbsc-gcp-project-mvp.internal, executor 1570): is.hail.utils.HailException: Cannot create BlockMatrix: filtered entry at row 8491008 and col 406

Please let me know how to interpret this error message. Thank you.

Best.
Jina

This log file includes a full error message.
king_20210302.log (13.7 KB)

Hi @jinasong,

This means your matrix table has filtered entries. hl.king currently doesn’t handle filter entries. I’ll fix that. In the mean time just add mt = mt.unfilter_entries() before you call hl.king.

This PR should fix that: [query] teach king to treat filtered entries as missing by danking · Pull Request #10134 · hail-is/hail · GitHub

Hi @danking,

Thank you so much for your efforts. I will retry it.

-Jina

Hi @danking

After adding mt.unfilter_entries(), the error message was gone. Thanks again.
By the way, this function has been running continuously for more than 16 hours. This work is on Dataproc on GCP with autoscaling which can use up to 4000 cores.

Can we consider this to be a normal process for king()? I just want to know if I should wait for a long time or quit this work and try to solve the issue.

Thank you.
Jina

Are you using all 38 million variants? You probably do not need all 38 million variants. Most folks that I know use about five to ten thousand common variants. Each rare variant contributes very little information to relatedness.

KING, like most relatedness methods, scales like N_SAMPLES^2 * N_VARIANTS, so using all 38 million variants is very time consuming.

Hi @danking

Got it. Do you have a specific way to select essential common variants for the kinship analysis in Hail? Thank you for supporting this work continuously.

Best,
Jina

That’s a biology question and I’m not really a biologist. Using only those variants with at least 5% or at least 1% minor allele frequency seems reasonable to me.

I am not a biologist either. Your comment is very helpful enough for me to get directions. Thanks a lot!!

-Jina