Revert explode or nested group_by

I have text file obtained from a variant hail table after explode and eventually addition of annotations.

# chrom pos ref alt transcript_id pdb_id
  1     10  A   T   ENST00001     PDBA
  1     10  A   T   ENST00002     PDBB
  1     10  A   T   ENST00002     PDBC

I wish to load the table in hail as a nested struct

Row fields:
    'locus': locus<GRCh38> 
    'alleles': array<str> 
    'transcripts': array<struct {
        transcript_id: str, 
        pdbs: array<str>
    }> 

I succeed to reconstruct locus and allele and do a first group by

# Reconstruct locus and allele and set as key
ht_keyed = ht_tsv.annotate(
    locus = hl.locus(ht_tsv.chrom, hl.parse_int(ht_tsv.pos), reference_genome='GRCh38'),
    alleles = hl.array([ht_tsv.ref, ht_tsv.alt])
)
ht_keyed = ht_keyed.key_by(ht_keyed.locus, ht_keyed.alleles)

# Aggregate transcript over variant
ht_agg = ht_keyed.group_by('locus', 'alleles').aggregate(
        transcripts = hl.agg.collect_as_set(ht_keyed.row.transcript_id) 
)

I got

| chr1:10 | ["A","T"] | {"ENST00001","ENST00002"} |

Eventually I would need to group by transcript id the subset of transcripts of the current variant … to obtain something like

locus: chr1:10
alleles: ["A","T"]
transcripts: [
  { transcript_id: "ENST00001", pdbs: ["PDBA"] },
  { transcript_id: "ENST00002", pdbs: ["PDBB", "PDBC"] }
]

I am not sure how I could nest group_by …

Hi!
How about doing “bottom up” two group-by’s by hand?:

# setup
data = {
    "chrom": ["chr1", "chr1", "chr1"],
    "pos": [1, 1, 1],
    "ref": ['A', 'A', 'A'],
    "alt": ['T', 'T', 'T'],
    "transcript_id": ['ENST00001', 'ENST00002', 'ENST00002'],
    'pdb_id': ["PDBA", "PDBB", "PDBC"]
}
ht = hl.Table.from_pandas(
    pd.DataFrame(data)
)
ht = ht.annotate(
    locus = hl.locus(
        contig = ht.chrom,
        pos = hl.int(ht.pos),
        reference_genome = "GRCh38"
    ),
    alleles = [ht.ref, ht.alt]
).drop(
    "chrom", "pos", "ref", "alt"
).key_by(
    "locus", "alleles"
)
# group-by's
ht = ht.group_by(
    "locus", "alleles","transcript_id"
).aggregate(
    pdbs = hl.agg.collect( # collect_as_set, ... 
        ht.pdb_id
    )
)
ht = ht.group_by(
    "locus", "alleles"
).aggregate(
    transcripts = hl.agg.collect_as_set(
        hl.struct(
            transcript_id = ht.transcript_id,
            pdbs = ht.pdbs
        )
    )
)
ht.show()

image

2 Likes