I have text file obtained from a variant hail table after explode and eventually addition of annotations.
# chrom pos ref alt transcript_id pdb_id
1 10 A T ENST00001 PDBA
1 10 A T ENST00002 PDBB
1 10 A T ENST00002 PDBC
I wish to load the table in hail as a nested struct
Row fields:
'locus': locus<GRCh38>
'alleles': array<str>
'transcripts': array<struct {
transcript_id: str,
pdbs: array<str>
}>
I succeed to reconstruct locus and allele and do a first group by
# Reconstruct locus and allele and set as key
ht_keyed = ht_tsv.annotate(
locus = hl.locus(ht_tsv.chrom, hl.parse_int(ht_tsv.pos), reference_genome='GRCh38'),
alleles = hl.array([ht_tsv.ref, ht_tsv.alt])
)
ht_keyed = ht_keyed.key_by(ht_keyed.locus, ht_keyed.alleles)
# Aggregate transcript over variant
ht_agg = ht_keyed.group_by('locus', 'alleles').aggregate(
transcripts = hl.agg.collect_as_set(ht_keyed.row.transcript_id)
)
I got
| chr1:10 | ["A","T"] | {"ENST00001","ENST00002"} |
Eventually I would need to group by transcript id the subset of transcripts of the current variant … to obtain something like
locus: chr1:10
alleles: ["A","T"]
transcripts: [
{ transcript_id: "ENST00001", pdbs: ["PDBA"] },
{ transcript_id: "ENST00002", pdbs: ["PDBB", "PDBC"] }
]
I am not sure how I could nest group_by …