Visualization and analytics frontend

Hi!
I’d like to ask for a recommendation as a not a very experienced developer. I apologize if this is not the right place for this:

Having a hypothetical smallish VCF and a dataframe-like data about phenotypes stored in Postgres (ca. 5k people x (WES + 1k phenotypes)), I would like to develop a dashboard-like web app with simple, nice-looking interface. The purpose of the app would be to enable simple exploration of the data after selecting/filtering/grouping/aggregating both vertically and horizontally and presenting the result in the forms of an explicit table, barplots, histograms etc., with as small a time-delay as possible. Example of queries would be: 1) total number of deletions in chrX of a particular individual, 2) list of all variants from a predefined list that a given individual has, 3) mean number of individuals with specific genotype in a specific locus. I imagine some aggregate statistic could be precomputed, but most queries would need to be computed ad hoc. Additionally performing GWAS-like analysiss (hypothesis testing, dimensionality reduction, some general ML would be nice) should be possible on the backend - not necessarily in real time, and their results would be presented in the app as well. The data does not change other than being incrementally updated once a week. The app would serve the results to anyone with Internet access (and appropriate credentials).

Hail seems like a great tool for importing and manipulating the genotypic data, doing gwas, exporting the data to spark. On the frontend, something like python library Dash seems appropriate.

Given the above lengthy confession, here’s my questions:
Do You have any recommendations? Is this even doable? What technologies should I use? Should I put the vcf in a database? If I understand correctly, Hail’s native data format can also be used to persist the data on disk and query it in a lazy manner? Additionally, I’ve been instructed to look into Parquett and Cassandra. Any suggestions or reactions from the Hail community will be appreciated!

You might contact the gnomAD team. They have investigated using Hail as a backend. There are some issues that Hail team needs to sort out in the next several months before this is feasible.

You’ll have a bad time with Parquet because it has limits on the number of columns. It won’t be able to represent modern high-throughput sequencing datasets in a row-major way. You’d have to represent them point-wise which would severely impact memory use and performance.

I haven’t investigated Cassandara. I believe the gnomAD team uses elastic search currently.

Hail’s format indeed can be used to query the data on disk. If you write filter queries that look for one locus, the lookup will only read the data for that locus. Queries that look at one or a few loci should be quite fast, but maybe not web-fast (i.e. <100ms).

We don’t have efficient sample queries, so we have to read every sample’s data to read one sample’s data in a given locus range. We’re working on a fix for this, but it’s a year a way at the soonest.

Hail also lacks incremental sample or variant addition. We’re thinking about fixes to this, but again, at least a year away.

Everything you’re asking for are features we’re very interested in developing. Hail as web backend for genetics is an explicitly goal of ours. Unfortunately, you’re about a year too early :confused:

1 Like

Many thanks for the insight!
I’ve been also looking into tileDB for the purpose of storing and quering VCFs: https://docs.tiledb.com/genomics/
I’ll be sure to share my adventures with this project on the forum :slight_smile:

1 Like