I am running the function on a small snp subset with appr 18000 samples. I have also tried running the function with the standard settings for bp_window_size and memory_per_core with the same results. I use 6 nodes with each 16 cores and 180gb RAM.
We really need to change the default block_size. 4096 is far too large in my experience. @Mathias_Hansen, for both pc_relate and ld_prune please try a block size of 1024.
It is unacceptable that Spark blows up on this tiny block size, but I have not had the time to understand why.
I first tried to reduce the block_size to 1024 as you suggested, which did not change the error message. I then tried to increase the subset of data (to appr a whole chromosome), now I get another Exception from running both ld_prune() and pc_relate():
“inbreeding does not support multiallelic variants/genotypes. Found genotype 0/2.”
I call split_multi() on the dataset first, as suggested by your guide. I assume this could create 0/2 genotypes, so I don’t understand the error.
I have added filter_rows(len(data.alleles) == 2) after the split_multi() call to be sure, but this returned the same exception.
KR
I used the split_multi() without any alternative arguments. And you would suggest split_multi_hts()? I don’t exactly understand why this should be used instead? I tried running split_multi_hts(), this results in ld_prune() removing all variants, and it returns a HailException: block matrix must have at least one row.
Particularly this part seems (to me) to indicate that the “regular” split_multi() works fine with ld_prune():
Note: Requires the dataset to contain no multiallelic variants. Use split_multi() or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.