List of Various Beginner Questions

Hi folks,

I’ve been playing with the Hail platform on and off for a few months and I’m really excited about it.

I have a few questions since I found Zulip too hard to use:

  • Is there some standard back-end database system in use?

  • if I want to update the local image frequently, does that wipe-out my stored data?

  • Is there a way to load JSON formatted output from Nirvana in lieu of just VCF’s?

  • Do you have any info on the dockerized version of Hail?

  • I would like to make my local Hail accessible through API’s by using URL’s (can’t think of better way to describe at this hour). Is this on the roadmap?

  • Can you share a formula for estimating the data footprint of my test data?

  • Can you suggest a high-throughput way of loading phenotype data from an external source piece-meal, ie: add a few VCF’s and add a few phenotype tables, a pair per sample?

  • What best way would you suggest to mark variant entries as imputed vs genotyped?

Thanks you for your patience!

Daniel

Will address point by point:

Is there some standard back-end database system in use?

Hail distributes computations with Apache Spark. Beyond that, there is no database in use for persistent storage, though we have found that object stores like Google Storage and Amazon S3 work very well with Spark/Hail.

if I want to update the local image frequently, does that wipe-out my stored data?

I don’t fully understand. By ‘image’, do you mean Hail build? If so, then no! For the life of Hail 0.2, your files will remain readable and your pipelines will remain executable. We still encourage you to update frequently to take advantage of bugfixes and performance improvements, as well as new features!

Is there a way to load JSON formatted output from Nirvana in lieu of just VCF’s?

It is absolutely possible to load data as newline-delimited JSON. You can load this with the import_table function. If you attempt to do this, we are happy to help in a new forum thread.

I would like to make my local Hail accessible through API’s by using URL’s (can’t think of better way to describe at this hour). Is this on the roadmap?

This really depends what you want to do. I imagine it’s not hard to build a web service on top of a running Hail cluster, but we don’t have any concrete plans to do this.

Can you share a formula for estimating the data footprint of my test data?

Generally Hail formats are comparable to JSON/VCF (though usually slightly smaller).

What best way would you suggest to mark variant entries as imputed vs genotyped?

I imagine that you have a list of the directly genotyped variants somewhere – you can join with this list to annotate presence / absence. Again, happy to help with the specifics.