Hello Hail team!
I’ve recently started using Hail 0.2 but I got stuck with the following problem.
Is there a way to write a Hail table to a tab separated file in wide format?
More in details: I have a MatrixTable with this structure:
mt.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'gene_id': str
----------------------------------------
Entry fields:
'number': int64
----------------------------------------
Column key: ['s']
Row key: ['gene_id']
----------------------------------------
mt.show()
gene_id | Sample1.number | Sample2.number | Sample3.number | Sample4.number |
---|---|---|---|---|
str | int64 | int64 | int64 | int64 |
“ENSG00000000457” | 9 | 5 | 5 | 5 |
“ENSG00000000460” | 23 | 13 | 11 | 11 |
“ENSG00000000971” | 0 | 4 | 4 | 4 |
“ENSG00000001036” | 3 | 3 | 3 | 3 |
I would like to write the MatrixTable to a file table, keeping the same wide format that we can see from the mt.show().
The problem is that I have around 15k genes and around 500k samples.
I tried with the command make_table() but after 24 hours running (on a cluster node with 28 processors) it was still only at half the process.
I tried using ht = mt.entries() and then ht.export(“myfile.tsv.bgz”) but, after unzipping the bgz file, I got a file of 170 gigabytes, containing the table I wanted but in long format, of which an example is here:
gene_id s number
ENSG00000000419 Sample1 5
ENSG00000000419 Sample2 8
ENSG00000000419 Sample3 0
ENSG00000000419 Sample4 3
ENSG00000000419 Sample5 23
ENSG00000000419 Sample6 14
If the file was smaller I could easily pivot the table using pandas or tidyverse. But I’m afraid that 170 gigabytes is a bit too much. I could try using Spark or other strategies for big data but I keep telling myself there should be a way to do this natively with Hail!
Can anyone help me? Thank you!