Is it possible to use matrix with a database tool?

I am not sure which category that request should actually belong to. The thing which concerns me is the fact that I need to read matrix into memory, while I wish I could make a request like to a database and get some data on RSID, locus and so on.

mtx = hl.read_matrix_table("MTTEST.mt")

I can’t query mtx without reading it. Only then I can do:

mtx.filter_rows(mtx.locus==hl.locus(contig="5", pos=12114, reference_genome="GRCh37")).rsid.show()

Is it possible to keep matrix data in the database? What do I do with big matrices, should I create several “parts” and make same request via looping through the parts?

Hi @annalisasnow ,

hl.read_matrix_table does not put the matrix table into memory. Most matrix tables are far too large to fit in memory. Instead, it streams through the data one “partition” at a time. A “partition” is a group of rows.

Hail only reads a very small amount of data if you filter_rows and use == with one of the matrix table’s row keys. I would expect your example to read exactly one row’s worth of data from disk.

Do you experience high latency when executing your example?

Also, it’s important to understand that most Hail methods are lazy. For example, hl.read_matrix_table doesn’t actually read anything into memory, it only begins to build a description of a pipeline. Only when you run a method like mt.show which requires a result to be computed is the entire pipeline compiled and executed. In your example, that pipeline says that only a single locus needs to be read, and the Hail compiler is able to see that and read only a single row from disk, as @danking said.

1 Like

yes, I guess to a certain extent. There’s also funny behaviour when importing vcf and plink files, based on the same data. As vcf is tolerated, while plink has a memory issue. I guess I should start another thread for the latter issue.