Do y’all have a target date and/or set of features for the 0.3 release?
Currently there’s no concrete timeline for 0.3. Changes to interfaces (from function renames to larger redesigns) are tagged as 0.3 because we can’t break functionality within 0.2, but we have to balance our desire to improve interfaces with the community burden of rewriting the large number of pipelines and libraries built for 0.2. It took more than a year after the 0.1 => 0.2 version switch until we no longer had frequent questions about 0.1, and I think this would be an even longer process now.
For now, we’re happy developing within the 0.2 version, since the vast majority of our work is on backend infrastructure that is interface-agnostic. We’ve improved performance by 2-2.5x in the last year, with more to come soon. We are building towards distributed linear algebra support, and there’s no hugely compelling reason we need to gate that behind a major version transition.
Thanks for the quick reply @tpoterba!
Any chance y’all could update the For Software Developers docs to include information on how to run the benchmark suite and what’s in it?
I certainly intend to improve documentation and visibility when the benchmarking tools have stabilized (this is a recent project, built over the last few months). One of the complexities, though, is that the benchmarks take around 50-150 CPU hours depending on parameters, and that runs best on the batch execution service used by our CI. Developers unaffiliated with the team will be unable to use this service for the time being.
A somewhat orthogonal goal that may address some of the same concerns is to run the benchmarks on hail:master as a daily cron job, and build a public UI for tracking performance changes over time. This is a medium-priority personal objective, and something I hope to have in a reasonable state by spring/summer 2020.
Although we’re using semvar for Hail releases, we’re being more strict than necessary and only plan to increment the minor version when we break backward compatibility. We’re using 0.3 issues to track changes we want to include then. We don’t have a target date for 0.3 (which we will probably make 1.0 to more accurately reflect our release semantics). We’ve tried to make things forward extensible so we can continue to improve Hail and add features without breaking backward compatibility. We’ll break backward compatibility when the benefit of doing so outweighs the pain.
We’ve also talked about deprecating features and parts of the interface to allow a more incremental path forward while warning about breaking changes, but haven’t settled on anything.
In addition to the listed issues, we’ve talked about a few big potential breaking changes:
- BlockMatrix was originally built on Spark’s linalg but has diverged. It will be subsumed by distributed n-d arrays (under development), and we will likely want to drop support for the interface. This is a good candidate for deprecation. Alternatively, we could re-implement the block matrix interface in terms of dndarrays.
- Make the Table and MatrixTable interfaces mutable by default. While it will makes chaining easier, closer to what you find in pandas and R, it is a subtle change that is sure to cause lots of confusion. For the moment I think we’ve decided it isn’t worth it.
- Higher-dimensional tables (what we internally call TensorTables). If we do this, we’d probably want to separate the 1-dimensional marginal tables associated with a MatrixTable and add something like an xarray.Dataset to store multiple variables with a common set of dimensions. This would certainly be a major release. We don’t have any concrete work in this direction yet.
As for performance testing, most of the benchmarks have come from user workloads of interest or performance regressions. Let us know if there’s things you’d like to see (or add). We’ve talked about translating relevant parts of TPC (or even adding a SQL front end which would be awesome), although we don’t currently have the bandwidth. There’s also work to be done comparing Hail performance to other systems (e.g., Spark, PLINK, GATK) which we periodically do for various operations but not systematically.