Log of breaking changes in 0.2 beta


This thread documents breaking changes as we work toward stabilizing the development branch (i.e., master, 0.2 beta) as Hail 0.2 proper.

Clarification of `agg.sum` behavior

Removed as_array parameter from PCA

pca and hwe_normalized_pca no longer take an as_array parameter. They now always return scores and loadings as arrays (formerly the as_array=True option).

See the overview tutorial for example usage in GWAS, where PC1 becomes scores[0].


Removed dataset parameter from eight methods

All methods that took a dataset and at least one required expression on that dataset no longer take a dataset parameter at all (the dataset is implicitly the source of the expression):



Changed ys to y and schema in linear regression

Consistent with the other statistics methods, the parameter ys on linear_regression is now y, and when y is an expression the linreg fields all have type float64. This is consistent with the other regression methods.

When y is a list of expressions (even a list of one expression) the behavior is the same as before: the the five y-dependent linreg fields have type array[float64].

The field n_complete_samples is now just n.

See the overview tutorial for example usage of the case where y is an expression. In particular, linear_regression_results.linreg.p_value[0].collect() no longer takes [0].




Oops. See:


ld_prune has changed to take a CallExpression instead of a matrix table. The new signature is ld_prune(call_expr, r2=0.2, window=1000000, memory_per_core=256).

See: https://github.com/hail-is/hail/pull/3518


ld_prune no longer requires unphased genotypes (though it still makes no use of phasing information). And the parameter window has been renamed bp_window_size.

See: https://github.com/hail-is/hail/pull/3575


While we’re at it, it also returns a Table with just ('locus', 'alleles') that is the set of independent variants at that threshold (rather than previously returning the MatrixTable filtered to that set).
















Minor breaking change: hl.min_rep() now returns struct of locus (a LocusExpression) and alleles (an ArrayExpression of type str). This makes min_rep and re-key much easier as in:

mt = mt.key_rows_by(**hl.min_rep(mt.locus, mt.alleles))


minor change:

the parameter names of hl.rand_unif(min, max) are changing to lower and upper.