Ld_prune() returns SparkException

I am running the function on a small snp subset with appr 18000 samples. I have also tried running the function with the standard settings for bp_window_size and memory_per_core with the same results. I use 6 nodes with each 16 cores and 180gb RAM.


The same error occurs when I run pc_relate().


@jbloom @danking any ideas?

Seems to be related to block matrix (and only 71 variants!)

We really need to change the default block_size. 4096 is far too large in my experience. @Mathias_Hansen, for both pc_relate and ld_prune please try a block size of 1024.

It is unacceptable that Spark blows up on this tiny block size, but I have not had the time to understand why.

I first tried to reduce the block_size to 1024 as you suggested, which did not change the error message. I then tried to increase the subset of data (to appr a whole chromosome), now I get another Exception from running both ld_prune() and pc_relate():
“inbreeding does not support multiallelic variants/genotypes. Found genotype 0/2.”
I call split_multi() on the dataset first, as suggested by your guide. I assume this could create 0/2 genotypes, so I don’t understand the error.
I have added filter_rows(len(data.alleles) == 2) after the split_multi() call to be sure, but this returned the same exception.

How are you using split_multi?

See the difference between these two:

I used the split_multi() without any alternative arguments. And you would suggest split_multi_hts()? I don’t exactly understand why this should be used instead? I tried running split_multi_hts(), this results in ld_prune() removing all variants, and it returns a HailException: block matrix must have at least one row.

split_multi does not change the genotype at all. split_multi_hts does.

If ld prune is removing all variants after split_multi_hts, that’s something to look into.

What variant caller does your dataset come from?

OK, then I just find your manual misleading here:
bcftools ver 1.3 and htslib 1.3 was used for variant calling.

which part of the LD prune docs is confusing? I want to improve them if possible!

Particularly this part seems (to me) to indicate that the “regular” split_multi() works fine with ld_prune():
Note: Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

I would still like help regarding the potential variant caller issue.

Do you still get this error with split_multi_hts?

I still have the problem that use of split_multi_hts() will result in ld_prune() removing all variants, yes.

what do your genotypes look like after split_multi_hts?

if you export the variant qc table do things look sensible?

The problem was solved on the server side.