By “block”, I was referring to LD blocks, so non-overlapping regions in the genome. Experience told us that running a lasso or a elastic net using the entire dataset is not only slow but also not as good as breaking it down into LD blocks. The size of the blocks is usually within a few thousand SNPs so the whole thing can still be handled by pandas. I got so far as to use make_table and to_spark to export a subset of the data to spark, but am not sure how I can go about parallelizing the blocks across the genome. Say we have 1000 LD blocks and 2500 partitions. Is it possible to get the executors to process blocks instead of SNPs? Thanks a lot!
I have a few ideas:
- We could expose something that writes out a VCF per partition and runs a user-specified tool on the VCF, collecting results. This isn’t my favorite option, and it seems hacky.
- We implement LASSO / elastic net methods on MatrixTable. Then, we implement a GroupedMatrixTable that can do any MT operators on row / col groups. This makes the entire thing basically two lines of code, but we’re a long way from being able to do this.
- We add ndarrays as a Hail type, and implement regularized regression methods on ndarrays. We can then easily add some operation which would convert your MatrixTable to a Table with a
genefield and a
datafield (ndarray), where you can do the regression trivially per record.
I think number 3 is feasible on the ~6 month time scale.