I’m totally new to Hail, but I hope it does what I need.
I’ve downloaded DBNSFP and just for test I’ve executed:
table = hl.import_table('/DATABASES/dbNSFP40/dbNSFP4.0a_variant.chr9', missing=".", delimiter='\t', impute=True)
Hail version: 0.2.28-61941242c15d
Error summary: StackOverflowError: null
what is wrong?
In : table.count()
[Stage 42:=====================================================>(207 + 3) / 210]Out: 3388601
ps. table.count() took like 10 sek, it’s too slow
Ah, sorry about this. A few questions:
is this the full pipeline that fails? import/show?
count() is not instant because Hail executes the import to do the count. In order to process datasets too large to fit in memory, Hail streams through data, meaning that it lazily executes pipelines, possibly importing data from disk multiple times (if you count and show, for instance). How large is the DBNSFP data, and what is its schema?
- yes, it’s all what you need to provoke this error
dbNSFP is just a big .tsv file. For example a relatively small chromosome 9 file has 3388602 rows x 376 cols.
you can check it easily by yourself:
then in python:
import hail as hl
table = hl.import_table('dbNSFP4.0a_variant.chr9', missing=".", delimiter='\t', impute=True)
ps. My server has 125GB RAM
OK, will try to replicate!
I don’t think this is related to the file size though
This is a large download – would you mind attaching the first 10 lines of the file (as gz is fine) here? I think I should be able to replicate with that.
yeah, main zip has 25GB
here is the head:
chr9_head.txt (132.5 KB)
Thanks. I can totally replicate this – I think probably show() will fail for any table of this width. We’ll fix that, sorry!
hmm as I understood Gnomad has much more columns, so it is strange that this error was not discovered before
ps. a question a bit out of topic: currently I use Sqlite3 database:
I have one table with ~100 mln rows and 6 columns
the other table with ~11mln rows and ~400 columns
They contains all annotations I need. The indexed key column is a coord column “
1_123133_A_G). Querying this tables takes no time, however I’m worried about future performance when the tables will grow.
I hoped to store such data in Hail, but what I see for now (and I’m veery new in Hail) that querying hail is slower:(
for example this takes 0.001s in sqlite3:
"select * from anns_table where coord = "X_107869475_G_A" # table has 11 mln rows
and this take ~0.9s:
mt.entry.take(1) ## this “mt” is created from VCF with 120k variatns…
Seems like hail is slower
Hail definitely will have higher latency for single-row queries than sqlite. If you’re building an application that executes single row queries in response to user input, for example, then Hail isn’t the right choice of backend. Hail is designed for efficient batch processing, and is probably a good choice for doing ETL before dumping data into a low-latency system like sqlite or elastic search.
What was the full Hail pipeline before the
entry.take(1)? Was it just
read_matrix_table from disk? I’d like to see this query take 10-100 ms, but there’s not an obvious immediate path to 1ms.
well, I don’t make any “advanced tests”, just “feeling” the data.
sqlite> select count(*) from patients_regions ;
Run Time: real 1.096 user 0.183413 sys 0.910287
so my sqlite3 table has 108642855 rows.
now take some data from some patient from the bottom of the table:
sqlite> select * from patients_regions where patient_id = 1087 limit 1;
**Run Time: real 0.000 user 0.000218 sys 0.000256**
it took ~0.000 sec.
now some Hail experiments: I load vcf with 116656 variants:
mt = hl.read_matrix_table('final.mt')
then I do:
start = time.time()
end = time.time()
print(end - start)
| locus | alleles |
| locus<GRCh37> | array<str> |
| 1:13116 | ["T","G"] |
showing top 1 row
some other times:
so it’s slower in hail to take some first record from 100k records,
than for Sqlite3 to take some last record from 100M records.
(I actually don’t know yet how to filter n-th record from hail matrix table, but I suspect, that taking n-th record from 100M hail table will be slower than taking it from Sqlite3)
(yes, I’m building an app that display single row queries in response to the user input, so your clarifications were very helpful, thanks!)
You’re comparing apples and oranges , but even if we fixed the experiment to properly compare sqlite3 and Hail, sqlite3 will outperform Hail for this operation.
sqlite3 is designed for excellent single machine performance. Hail is designed to (not so excellently) use more than one machine. This design has a cost.
 Your schema appears to use a single row for each variant-sample pair whereas Hail uses a single row for all the samples in one variant. When Hail shows one row, Hail is showing the equivalent of many rows of your SQL table.
Yeah, I know I compare wrong things, but I see that simple Hail operation already takes some time.
I just want to create some robust data source that I could quickly query and eventually perform some simple operations.
Gnomad Genomes v2.1.1 has 261942336 single variants,
for every variant I can generate ~400 annotations, so having Sql table of size 261942336x400 would be problematic.