Hi,
Is there any way to chop the whole Hail table into smaller chunks?
The number of rows of Hail table from gs://gcp-public-data–gnomad/release/2.1/ht/exomes are more than 17 million rows. I can only process chunks of, for example, 20000 rows. But I can find way to slice through the table except for head, tail for filter which I can not specify the start and end indexes for rows I want to extract.
Hi, thanks for your reply. I mean I want to take a portion of hail table and process it. For another portion, I may process it differently.
Currently, for example, if I want to take 501th row to 1000th row, I extract a table like this:
Given ht is the hail table with all the rows, originally
My portion will be
portion = ht.head(1000).tail(500)
I am wondering is there any way to slice through the hail table that is similar to list or pandas in Python
For example, I can use: portion = ht[501:1001]
What is your application after moving this data into Python? Hail is not designed to make this kind of operation (localizing subsets of a table as Python objects) particularly efficient.
I have a gene list which I want to filter from the Hail table. But the hail data I have contains more than 17 million rows which I can not filter in one go and it takes a lot of time to process the whole hail table. That’s why I would like to split it into, let say, 12 parts and process each part.
I expected end result is hail table with only concerned gene_symbol.
The gene_symbol is a row sub-field belonging to vep.transcript_consequences.
you have a gene list (say, a list of gene name strings in Python). You want to filter the table to rows where the ht.vep.transcript_consequences[...].gene_symbol is in that list.
If so, it’s definitely possible to express this as a transformation within Hail. If you could process the data in 12 parts, what exactly would you do? Pull it into pandas, filter, move it back into Hail?
Yes. I want to filter the table to rows where the ht.vep.transcript_consequences[...].gene_symbol is in that list.
Which kind of transformation within Hail that you mentioned? After I process the data in 12 parts, I will collect the 12 parts to put in our in-house database. Our in-house database will contained more information. Those from Gnomad is just a portion in it. Thanks!
You can then export this to a TSV file, if you want, with Table.export. I should caution you that this query necessarily looks at every row of the table. It would be faster if you had a set of genomic intervals instead of gene names. If you did, then you could use this query which will do work promotional to the size of the intervals:
As above, you can export with Table.export after you do this.
As an aside: I encourage you to consider using Hail for all your analysis. Hail is a general purpose analytical engine. You should be able to express more or less any SQL-like query using the Hail library and you get the benefit of not needing to import data into a database engine.