How is the numbers of tasks determined?
The number of tasks is the number of parallel partitions. This is a set parameter of a native MT/Table format, and is determined by block size (on hl.init()) for VCF / text import, and by BGEN import parameters.
Do you think that inefficiencies creep in when the number of nodes gets closer to the number of tasks?
Yes, if the number of tasks is the number of cores, then the slowest task will be the total runtime. I’d expect exactly these curves in that case (sometimes with longer tails!)
I’m guessing the logistic regression happens during the fold action?
Yes. And the sortBy, actually. Spark/Hail execution is totally lazy, which can lead to things happening multiple times.
Also, do you anticipate that the speed of logistic regression will stay the same, or are you planning on working on that area more?
Logistic regression is slow, and the algorithm isn’t as fast as, say, solving linear regression (which uses extremely well-optimized BLAS routines). We could probably make it ~3x faster by rewriting it in C++, maybe a bit more, but not orders of magnitude. Though I do think there are better ways to hook it into our optimizer.