Way for slicing on Hail table

Hi,
Is there any way to chop the whole Hail table into smaller chunks?
The number of rows of Hail table from gs://gcp-public-data–gnomad/release/2.1/ht/exomes are more than 17 million rows. I can only process chunks of, for example, 20000 rows. But I can find way to slice through the table except for head, tail for filter which I can not specify the start and end indexes for rows I want to extract.

Many thanks!

@Duong , what do you mean by

Hail is a so-called “out-of-core” system. It streams through the data chunk-by-chunk, it will never read all 17 million rows into memory at once.

Hi, thanks for your reply. I mean I want to take a portion of hail table and process it. For another portion, I may process it differently.
Currently, for example, if I want to take 501th row to 1000th row, I extract a table like this:
Given ht is the hail table with all the rows, originally

My portion will be
portion = ht.head(1000).tail(500)

I am wondering is there any way to slice through the hail table that is similar to list or pandas in Python
For example, I can use: portion = ht[501:1001]

Many thanks,

What is your application after moving this data into Python? Hail is not designed to make this kind of operation (localizing subsets of a table as Python objects) particularly efficient.

I have a gene list which I want to filter from the Hail table. But the hail data I have contains more than 17 million rows which I can not filter in one go and it takes a lot of time to process the whole hail table. That’s why I would like to split it into, let say, 12 parts and process each part.

I expected end result is hail table with only concerned gene_symbol.

The gene_symbol is a row sub-field belonging to vep.transcript_consequences.

is this correct –

you have a gene list (say, a list of gene name strings in Python). You want to filter the table to rows where the ht.vep.transcript_consequences[...].gene_symbol is in that list.

If so, it’s definitely possible to express this as a transformation within Hail. If you could process the data in 12 parts, what exactly would you do? Pull it into pandas, filter, move it back into Hail?

Yes. I want to filter the table to rows where the ht.vep.transcript_consequences[...].gene_symbol is in that list.
Which kind of transformation within Hail that you mentioned? After I process the data in 12 parts, I will collect the 12 parts to put in our in-house database. Our in-house database will contained more information. Those from Gnomad is just a portion in it. Thanks!

If you want to filter to rows in a given set of genes, you can do this:

my_genes = hl.set([...])
ht = ht.annotate(
    genes_in_at_least_one_consequence = hl.set(ht.vep.transcript_consequences.gene_symbol)
)
ht = ht.filter(
    my_genes.intersection(ht.genes_in_at_least_one_consequence).size() >= 1
)

You can then export this to a TSV file, if you want, with Table.export. I should caution you that this query necessarily looks at every row of the table. It would be faster if you had a set of genomic intervals instead of gene names. If you did, then you could use this query which will do work promotional to the size of the intervals:

my_intervals = hl.array([hl.parse_locus_interval(...), ...])
ht = ht.filter(
    my_intervals.any(lambda interval: interval.contains(ht.locus))
)

As above, you can export with Table.export after you do this.


As an aside: I encourage you to consider using Hail for all your analysis. Hail is a general purpose analytical engine. You should be able to express more or less any SQL-like query using the Hail library and you get the benefit of not needing to import data into a database engine.

Hi danking,
Thanks for your answer and your suggestion using Hail for all my analysis. I will keep it in mind.

However, when I tried your suggested commands with a list of 2 genes:

my_genes = hl.set([“WASH7P”,“DDX11L1”])
ht = ht.annotate(
genes_in_at_least_one_consequence = hl.set(ht.vep.transcript_consequences.gene_symbol)
)
ht = ht.filter(
my_genes.intersect(ht.genes_in_at_least_one_consequence).size() >= 1
)

And I got this error: AttributeError: ‘SetExpression’ object has no attribute ‘intersect’.
Do you have any suggestions to fix it?

Many thanks.

Ah, sorry, it’s called intersection not intersect. I’ve fixed my code snippet above.