Group rows by several columns merging values into arrays

Hi,

I read in a csv file that has the following structure:

In [3]: t.describe()                                                                                                                                       
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    '#CHROM': str 
    'POSITION': str 
    'ID': str 
    'DNA_MUT': str 
    'PROT_MUT': str 
    'SNV': str 
    'INDELS': str 
    'ID_REF': str 
    'ID_ALT': str 
    'STRAND': str 
----------------------------------------
Key: []
----------------------------------------

I want to take ‘#CHROM’ and ‘POSITION’, generate an id from them (by concatenation, e.g. CHROM + ‘_’ + POSITION) and group by them merging the remaining values into arrays for unique concatenated #CHROM + POSITION. How to achieve that? The final result that I am looking for is the following:

ID                                data
CHROM + '_' +  POSITION          [[...], [...], ...]

Hi!
When I was learning about hail’s group-by functionality I’ve found the documentation pretty helpful: https://hail.is/docs/0.2/hail.GroupedTable.html#groupedtable :slight_smile:

I’m not a specialist but it’s fun to illustrate few things on You example.

Generate a dummy example which hopefully closely resembles Your t:

ht = hl.utils.range_table(8)
ht = ht.annotate(
    chrom = "chrom" + hl.str(ht.idx // 2 // 2),
    position = "position" + hl.str(ht.idx // 2 % 2),
    A = "A" + hl.str(ht.idx),
    B = "B" + hl.str(ht.idx)
).key_by().drop("idx")
ht.show()

image

I can group by more than one field (without the need to e.g. concatenate those fields as strings). The fields that I group by become keys. For other fields I need to choose an aggregation, let’s go with collect as You request:

ht_grouped = ht.group_by("chrom", "position").aggregate(
    A_collected = hl.agg.collect(
        ht.A
    ),
    B_collected = hl.agg.collect(
        ht.B
    )
)
ht_grouped.show()

image

As an aside, it might be useful to notice that I could’ve avoided explicitly typing in the names of fields that I’m aggregating since they are accessible by ht.row. E.g. the following would result in the same grouped table as above:

ht_ = ht.key_by("chrom", "position")

ht_grouped_ = ht_.group_by(*ht_.key).aggregate(**{
    field + "_collected": hl.agg.collect(ht_[field])\
    for field in ht_.row if field not in ht_.key
})
assert(ht_grouped_.collect() == ht_grouped.collect())

At this point further manipulations are a matter of my fantasy. Going with Your question I can construct new fields by concatenation of the old ones, and drop the latter:

ht_grouped_2 = ht_grouped.annotate(
    ID = ht_grouped.chrom + "_" + ht_grouped.position,
    data = [ht_grouped.A_collected, ht_grouped.B_collected]
).key_by(
    "ID"
).drop(
    "chrom", "position",
    "A_collected", "B_collected"
)
ht_grouped_2.show()

image

Notice that I first needed to change the key from [chrom, position] to ID in order to drop chrom and position.

2 Likes