Prepare hail entries for spark.ml

Hi,

I’m trying to pull entries data from hail to build predictive model with spark.ml. I came up with the following codes:

mt = hl.read_matrix_table(tmp)
mt = mt.unfilter_entries()
mt = mt.annotate_cols(g = hl.agg.collect(mt.GT.n_alt_alleles()))
test = mt.cols().select('g').to_spark()

and this is what I got:
image

The problem of this snippet is that hl.agg.collect() doesn’t guarantee the order of the array, which makes the feature tracking hard. Anyone has a solution for this? Would hl.str(mt.locus).collect() work?

Thanks a lot!

The order of hl.agg.collect()'s result here is the ordering of the tmp matrix table (at least it should be)

Great, thanks!

ah, we need to fix the docs!

1 Like