Read DBNSFP file

Hey,
I’m totally new to Hail, but I hope it does what I need.
I’ve downloaded DBNSFP and just for test I’ve executed:

table = hl.import_table('/DATABASES/dbNSFP40/dbNSFP4.0a_variant.chr9', missing=".", delimiter='\t', impute=True)

then:
table.show()

at is.hail.utils.richUtils.RichIterable.foreachBetween(RichIterable.scala:12)
at is.hail.expr.ir.Pretty$.is$hail$expr$ir$Pretty$$pretty$1(Pretty.scala:408)
at is.hail.expr.ir.Pretty$$anonfun$is$hail$expr$ir$Pretty$$pretty$1$6.apply(Pretty.scala:408)
at is.hail.expr.ir.Pretty$$anonfun$is$hail$expr$ir$Pretty$$pretty$1$6.apply(Pretty.scala:408)
at is.hail.utils.richUtils.RichIterator$.foreachBetween$extension(RichIterator.scala:35)
at is.hail.utils.richUtils.RichIterable.foreachBetween(RichIterable.scala:12)
at is.hail.expr.ir.Pretty$.is$hail$expr$ir$Pretty$$pretty$1(Pretty.scala:408)
at is.hail.expr.ir.Pretty$$anonfun$is$hail$expr$ir$Pretty$$pretty$1$6.apply(Pretty.scala:408)
at is.hail.expr.ir.Pretty$$anonfun$is$hail$expr$ir$Pretty$$pretty$1$6.apply(Pretty.scala:408)
at is.hail.utils.richUtils.RichIterator$.foreachBetween$extension(RichIterator.scala:35)



Hail version: 0.2.28-61941242c15d
Error summary: StackOverflowError: null

what is wrong?

In [62]: table.count()                                                                                                                                                                                       
[Stage 42:=====================================================>(207 + 3) / 210]Out[62]: 3388601

ps. table.count() took like 10 sek, it’s too slow :frowning:

Ah, sorry about this. A few questions:

  1. is this the full pipeline that fails? import/show?

  2. count() is not instant because Hail executes the import to do the count. In order to process datasets too large to fit in memory, Hail streams through data, meaning that it lazily executes pipelines, possibly importing data from disk multiple times (if you count and show, for instance). How large is the DBNSFP data, and what is its schema?

Hi,

  1. yes, it’s all what you need to provoke this error
  2. dbNSFP is just a big .tsv file. For example a relatively small chromosome 9 file has 3388602 rows x 376 cols.

you can check it easily by yourself:
bash:

wget ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFP4.0a.zip
unzip dbNSFP4.0a.zip
gunzip dbNSFP4.0a_variant.chr9.gz

then in python:

import hail as hl
hl.init()
table = hl.import_table('dbNSFP4.0a_variant.chr9', missing=".", delimiter='\t', impute=True)
table.show()

ps. My server has 125GB RAM

OK, will try to replicate!

I don’t think this is related to the file size though

This is a large download – would you mind attaching the first 10 lines of the file (as gz is fine) here? I think I should be able to replicate with that.

yeah, main zip has 25GB :slight_smile:
here is the head:
chr9_head.txt (132.5 KB)

Thanks. I can totally replicate this – I think probably show() will fail for any table of this width. We’ll fix that, sorry!

hmm as I understood Gnomad has much more columns, so it is strange that this error was not discovered before :slightly_smiling_face:

ps. a question a bit out of topic: currently I use Sqlite3 database:
I have one table with ~100 mln rows and 6 columns
and
the other table with ~11mln rows and ~400 columns

They contains all annotations I need. The indexed key column is a coord column “chr_start_ref_alt” (1_123133_A_G). Querying this tables takes no time, however I’m worried about future performance when the tables will grow.

I hoped to store such data in Hail, but what I see for now (and I’m veery new in Hail) that querying hail is slower:(

for example this takes 0.001s in sqlite3:
"select * from anns_table where coord = "X_107869475_G_A" # table has 11 mln rows

and this take ~0.9s:
mt.entry.take(1) ## this “mt” is created from VCF with 120k variatns…

Seems like hail is slower :frowning:

Hail definitely will have higher latency for single-row queries than sqlite. If you’re building an application that executes single row queries in response to user input, for example, then Hail isn’t the right choice of backend. Hail is designed for efficient batch processing, and is probably a good choice for doing ETL before dumping data into a low-latency system like sqlite or elastic search.

What was the full Hail pipeline before the entry.take(1)? Was it just read_matrix_table from disk? I’d like to see this query take 10-100 ms, but there’s not an obvious immediate path to 1ms.

well, I don’t make any “advanced tests”, just “feeling” the data.
SQLite3 dabase:

sqlite> select count(*) from patients_regions ;
count(*)
108642855
Run Time: real 1.096 user 0.183413 sys 0.910287

so my sqlite3 table has 108642855 rows.

now take some data from some patient from the bottom of the table:

sqlite> select * from patients_regions where patient_id = 1087 limit 1;
108504348|1087|1|201|44|288|25|19|1|0.113|0.003
**Run Time: real 0.000 user 0.000218 sys 0.000256**

it took ~0.000 sec.

now some Hail experiments: I load vcf with 116656 variants:

hl.import_vcf(some_final.vcf').write('final.mt', overwrite=True) 
mt = hl.read_matrix_table('final.mt')

then I do:

start = time.time() 
mt.row_key.show(1) 
end = time.time() 
print(end - start) 

I get:

+---------------+------------+
| locus         | alleles    |
+---------------+------------+
| locus<GRCh37> | array<str> |
+---------------+------------+
| 1:13116       | ["T","G"]  |
+---------------+------------+
showing top 1 row
0.5431049156188965

some other times:
0.24679160118103027
0.2239077091217041

so it’s slower in hail to take some first record from 100k records,
than for Sqlite3 to take some last record from 100M records.
(I actually don’t know yet how to filter n-th record from hail matrix table, but I suspect, that taking n-th record from 100M hail table will be slower than taking it from Sqlite3) :slight_smile:

(yes, I’m building an app that display single row queries in response to the user input, so your clarifications were very helpful, thanks!)

You’re comparing apples and oranges [1], but even if we fixed the experiment to properly compare sqlite3 and Hail, sqlite3 will outperform Hail for this operation.

sqlite3 is designed for excellent single machine performance. Hail is designed to (not so excellently) use more than one machine. This design has a cost.


[1] Your schema appears to use a single row for each variant-sample pair whereas Hail uses a single row for all the samples in one variant. When Hail shows one row, Hail is showing the equivalent of many rows of your SQL table.

Yeah, I know I compare wrong things, but I see that simple Hail operation already takes some time.
I just want to create some robust data source that I could quickly query and eventually perform some simple operations.

Gnomad Genomes v2.1.1 has 261942336 single variants,
for every variant I can generate ~400 annotations, so having Sql table of size 261942336x400 would be problematic.