TableJoin Tutorial: Exercises


#1

Trying to get started following the 0.2 tutorials. This references the Table Join exercise using the movie example data to answer the question “What genres are rated most differently by males and females?”
https://hail.is/docs/0.2/tutorials/08-joins.html#Exercises

  1. My code is below. Is there a more efficient solution to this problem? Is there a way to avoid creating the two new tables?

  2. hail seems to be notifying me that it is sorting the data multiple times. Why? Is there a way to avoid this?

  3. What prerequisite knowledge in distributed computing should I have? Can anyone suggest additional resources for learning patterns for more complex queries?


#2

The message about “ordering unsorted dataset with network shuffle” means that Hail is doing a distributed sort. It will do this in most cases when you .group_by, .key_by, or join two tables along a non-key field. For people with the biggest data, these messages may indicate that a pipeline will be in trouble – the infrastructure we rely on for the distributed sort is unreliable and loves to blow memory. We’re working on developing our own infrastructure to replace this.

Your answer above is totally correct, but there are much more efficient ways to write it. Here is one, which starts from the line with t = t.explode and goes down the the one where you annotate rating_diff.

t = t.group_by(t.genres).aggregate(
    rating_M = hl.agg.filter(t.sex == 'M', hl.agg.mean(t.rating)),
    rating_F = hl.agg.filter(t.sex == 'F', hl.agg.mean(t.rating)))

#3

We want to build a system that takes care of all the distribution for you, so that you don’t have to worry about it at all!

We also don’t really have resources for learning complex queries, though some of them appear in the how-to guide. This is something I want to create, though.