Ld_prune() returns SparkException

Mathias_Hansen · November 14, 2018, 11:03am

I am running the function on a small snp subset with appr 18000 samples. I have also tried running the function with the standard settings for bp_window_size and memory_per_core with the same results. I use 6 nodes with each 16 cores and 180gb RAM.

KR

Mathias_Hansen · November 14, 2018, 1:05pm

The same error occurs when I run pc_relate().

Mathias_Hansen · November 16, 2018, 8:02am

bump?

tpoterba · November 16, 2018, 3:04pm

@jbloom @danking any ideas?

Seems to be related to block matrix (and only 71 variants!)

danking · November 16, 2018, 4:34pm

We really need to change the default block_size. 4096 is far too large in my experience. @Mathias_Hansen, for both pc_relate and ld_prune please try a block size of 1024.

It is unacceptable that Spark blows up on this tiny block size, but I have not had the time to understand why.

Mathias_Hansen · November 19, 2018, 10:22am

I first tried to reduce the block_size to 1024 as you suggested, which did not change the error message. I then tried to increase the subset of data (to appr a whole chromosome), now I get another Exception from running both ld_prune() and pc_relate():
“inbreeding does not support multiallelic variants/genotypes. Found genotype 0/2.”
I call split_multi() on the dataset first, as suggested by your guide. I assume this could create 0/2 genotypes, so I don’t understand the error.
I have added filter_rows(len(data.alleles) == 2) after the split_multi() call to be sure, but this returned the same exception.
KR

tpoterba · November 19, 2018, 12:03pm

How are you using split_multi?

See the difference between these two:

Mathias_Hansen · November 19, 2018, 1:30pm

I used the split_multi() without any alternative arguments. And you would suggest split_multi_hts()? I don’t exactly understand why this should be used instead? I tried running split_multi_hts(), this results in ld_prune() removing all variants, and it returns a HailException: block matrix must have at least one row.

tpoterba · November 19, 2018, 11:10pm

split_multi does not change the genotype at all. split_multi_hts does.

If ld prune is removing all variants after split_multi_hts, that’s something to look into.

What variant caller does your dataset come from?

Mathias_Hansen · November 20, 2018, 11:32am

OK, then I just find your manual misleading here:
https://hail.is/docs/0.2/methods/genetics.html?highlight=ld_prune#hail.methods.ld_prune
bcftools ver 1.3 and htslib 1.3 was used for variant calling.

tpoterba · November 20, 2018, 11:41am

which part of the LD prune docs is confusing? I want to improve them if possible!

Mathias_Hansen · November 20, 2018, 12:02pm

Particularly this part seems (to me) to indicate that the “regular” split_multi() works fine with ld_prune():
Note: Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Mathias_Hansen · November 26, 2018, 1:52pm

I would still like help regarding the potential variant caller issue.

tpoterba · November 26, 2018, 2:50pm

Do you still get this error with split_multi_hts?

Mathias_Hansen · December 6, 2018, 1:03pm

I still have the problem that use of split_multi_hts() will result in ld_prune() removing all variants, yes.

tpoterba · December 6, 2018, 1:47pm

what do your genotypes look like after split_multi_hts?

if you export the variant qc table do things look sensible?

Mathias_Hansen · December 11, 2018, 12:06pm

The problem was solved on the server side.

Topic		Replies	Views
Ld_prune starts and stops error Hail Query & hailctl	1	671	May 30, 2019
HailException: Cannot create BlockMatrix: Hail Query & hailctl	2	394	February 19, 2020
LD pruning repeated errors Hail Query & hailctl	16	551	December 20, 2020
LD pruning and IBD Hail Query & hailctl	9	1267	November 10, 2023
Ld_prune OutOfMemoryError: Java heap space Hail Query & hailctl	5	697	January 21, 2020

Ld_prune() returns SparkException

Related topics