Web API projects

Hi all,
Do you know of any projects developing a web API layer for Hail? Open GitHub projects preferred of course :wink:

Yep! It’s in progress, and my focus. If you have any specific feature requests pertaining to the app, I’m very interested!


Hi Alex,
I absolutely have some input. Would you like me to share on this forum or would you prefer that I message you directly?


1 Like

We’d like to keep everything public if possible - the whole community benefits that way!

1 Like

Yes, as Tim suggests please share with everyone. Maybe to help organization, tag #web

Tim, it seems I can’t make categories yet. Could we make a category for web efforts? (#web or something similar). It seems distinct enough from other aspects of Hail development to maybe warrant separation.

Alex, I think I gave you privileges – can you try again?

if it’s a development chat, we may want to move it to http://dev.hail.is though

Well here goes then :slight_smile:

  • the import / export data functions are the most important to me; there is obviously a need to import metadata as well. Not sure if best format for latter is TSV, CSV or JSON.
    – import of VCF and export in PLINK is most important. Capacity to import Nirvana-annotated JSON files would be awesome.
  • the second most important function for me once data is in system is the capacity to generate PCA plots
  • then I would like to see filtering functions, based on metadata or specific criteria such as the ones in the “Let’s do a GWAS” tutorial/example
    These would be the basic functionalities would love to see in web API layer.

Thank you so much for hearing me out!

Thanks for the post! Can you share a bit more about the users you’re envisioning for this kind of system? It sounds like this could make it possible for people without as much programming experience to analyze big genetic data.

Hi Tim and Alex,
I’m not certain if that really is the intent or if I really understood the project that Alex is working on.

To be clear, I’m just looking for REST API endpoints to help perform the tasks I listed - not a web interface for doing so…

So, out of the box, there will still be a need for programming skills to make the API’s useful to the non-programmer.

Does that make sense?



Something we are actively working on (but somewhat separately from Alex’s current pull request) is totally abstracting Hail’s backend and front end. The backend will be a system that executes serialized json-like queries and returns data to whatever front-end is contacting it.

We’ll be setting up something like this for Broad internal use as a proof-of-concept in the next year, and it would certainly be possible for you to use a similar system when that’s ready.

The system you’re imagining is that your organization would stand up a running cluster behind a web API, and you’d build various systems against that?

Yes - that’s exactly it; I am looking to be able to run Hail as a headless app and access its querying and reporting functionalities through API’s. The graphical front-end used to structure the API calls should be irrelevant.
If I too am looking to build a proof of concept (read lite functionality, only need those described in previous message) of this type of system, what sort of effort do you think will be required? Also, do you think the current version of hail is mature enough to support this kind of use case?

Many thanks again for your expert advice!

Hi guys,
Any thoughts on the hours of effort required to build out API layer? It would be really helpful for me to have this info before end of quarter, 2018…


Separating our frontend and backend entirely will probably be done by end of Q1 2019. But you could set up a flask server running Hail on a tiny dataset in a couple of hours, I’d think…

I still don’t have a great grasp on the set of things you want to use this web API for. Maintaining a running cluster for just a few users would be expensive; there are economies of scale here.

@lasuperclasse, I’m also curious about your use-case. Would your users be happy with a Jupyter Hub Notebook that was hosted by the leader node of Spark? They could share a single cluster. They don’t need anything but a web browser.

It seems like your customers / users are programmers / software services? I think you could write a simple Flask app that wrapped calls to Hail’s python API, but serializing large VCFs and output CSVs seems not great?

A small Flask app that loaded a known dataset, ran PCA, and returned the image would be pretty simple:

import hail as hl
from flask import Flask
app = Flask()

datasets = {'foo': '/datasets/foo.mt', ...}

def pca(dataset):
    fname = datasets.get(dataset,None)
    if fname is None:
        return 'no such dataset', 404
    eigenvalues, scores, loadings = hl.hwe_normalized_pca(hl.read_matrix_table(fname).GT, k=5)
    return ???

I don’t know exactly what you want to return, but that’s like 7 lines to a PCA service.

1 Like

Thanks for getting back to me.
I don’t understand why the data set needs to be tiny if running Flask to serve up some results.
By tiny you mean <100k samples? or more like <100 samples? Isn’t the handicap of using flask more relevant to the number of requests served rather than the

I agree with the economies of scale coming into play only when you have a sufficient number of users making the maintenance worth while…

I don’t understand why the data set needs to be tiny if running Flask to serve up some results.

Do you intend that after receiving a request, the server will run Hail on some data? Or just serve some set of static precomputed results?

You mentioned earlier:

I imagine that it is possible to transfer 100,000 whole genomes through Flask and into a networked file system. You can then use your Hail cluster to manipulate that file. However, there are almost certainly better ways to transfer 100,000 whole genomes from the client machine to a networked file system that the Hail cluster can access.

I don’t know much about your environment or use-case, so it is hard to make recommendations. Can you describe a typical user interaction with this service? Do they start by uploading their data to the service or are all the datasets already present in the service? How does a user identify/refer-to a dataset? How does a user describe a variant filter, do they receive an identifier that refers to the filtered dataset, or does every operation take filtering arguments?

It would be more along the lines of option 1;
There would be a regular addition of data to the database (say daily or weekly), and various analysis requests would be performed ad-hoc;
There would be very little static pre-computed results, other than maybe some meta-information or database sizes since last update that would be static.