TableJoin Tutorial: Exercises

cburghard · November 9, 2018, 6:36pm

Trying to get started following the 0.2 tutorials. This references the Table Join exercise using the movie example data to answer the question “What genres are rated most differently by males and females?”
https://hail.is/docs/0.2/tutorials/08-joins.html#Exercises

My code is below. Is there a more efficient solution to this problem? Is there a way to avoid creating the two new tables?
hail seems to be notifying me that it is sorting the data multiple times. Why? Is there a way to avoid this?
What prerequisite knowledge in distributed computing should I have? Can anyone suggest additional resources for learning patterns for more complex queries?

tpoterba · November 9, 2018, 9:04pm

The message about “ordering unsorted dataset with network shuffle” means that Hail is doing a distributed sort. It will do this in most cases when you .group_by, .key_by, or join two tables along a non-key field. For people with the biggest data, these messages may indicate that a pipeline will be in trouble – the infrastructure we rely on for the distributed sort is unreliable and loves to blow memory. We’re working on developing our own infrastructure to replace this.

Your answer above is totally correct, but there are much more efficient ways to write it. Here is one, which starts from the line with t = t.explode and goes down the the one where you annotate rating_diff.

t = t.group_by(t.genres).aggregate(
    rating_M = hl.agg.filter(t.sex == 'M', hl.agg.mean(t.rating)),
    rating_F = hl.agg.filter(t.sex == 'F', hl.agg.mean(t.rating)))

tpoterba · November 9, 2018, 9:06pm

We want to build a system that takes care of all the distribution for you, so that you don’t have to worry about it at all!

We also don’t really have resources for learning complex queries, though some of them appear in the how-to guide. This is something I want to create, though.

Topic		Replies	Views
Ordering unsorted dataset with network shuffle Hail Query & hailctl	2	767	September 11, 2020
Room for improvement when joining multiple HTs? Hail Query & hailctl	7	578	November 23, 2021
Best approach to join a large number of tables to a matrix table? Hail Query & hailctl	4	571	November 17, 2021
RemoteDisconnect when joining matrix tables Hail Query & hailctl	1	72	March 6, 2025
Question about Ordering unsorted dataset with network shuffle Hail Query & hailctl	0	144	May 16, 2024

TableJoin Tutorial: Exercises

Related topics