Web API projects

Hi guys,
Wishing you first and foremost a merry Christmas and happy and healthy New Year.

By “that file”, (this is for my information), you mean that all the data is stored within a single massive file?

Hopefully these answers are sufficient:
Can you describe a typical user interaction with this service? A user would only be able to do certain basic analysis steps (such as the ones in “Let’s do a GWAS”) via a web-GUI. Hence the utility of having a REST API layer sit between Hail and a UI.

Do they start by uploading their data to the service or are all the datasets already present in the service? No, the data available to the user is what is present in the database at moment of query. that being said, fresh data would be added regularly.

How does a user identify/refer-to a dataset? The user would be querying the totality of the dataset that fits the filtering criteria.

How does a user describe a variant filter, do they receive an identifier that refers to the filtered dataset, or does every operation take filtering arguments? Not certain I really understand the question but I’m leaning towards the latter where a series of filters are applied to the dataset until they reach a number of subjects with acceptable phenotypes and allelic frequencies…

Again, I really appreciate these in-depth questions and your time for responding to it all!
Best,

No. We often refer to them as files, but they are stored as series of nested folders and physical files called partitions. The size of the partitions is configurable at write time for the Hail format.

Your description seems reasonable. It would not take much work to create a Flask app that executes Hail commands in the fashion you’ve described.

I have two more small related questions spurred by your answer and by looking at the tutorial.
In the example:
hl.import_vcf('data/1kg.vcf.bgz').write('data/1kg.mt', overwrite=True)

How would I load multiple vcf files into a collection?
If I have, say 1M files or 10M files, would I still load them all into the same matrix table?

mt = hl.read_matrix_table('data/1kg.mt')

Does this mean the entire matrix table gets read into memory to perform any operations? This might be challenging if indeed my datasets do reach the 1M or 10M entries…

see the docs: https://hail.is/docs/0.2/methods/impex.html#hail.methods.import_vcf

It takes a list of VCF files (split by variant).

Though I can tell you this won’t work with 1M files.

Maybe the loading of 1M files at once wouldn’t work but could hail handle querying this many data sets?
Otherwise, should i think about partitioning the data into multiple matrix tabels and searching across those?

querying data across 1M datasets at the same time? No, that will have problems as well. If you want to have 1M datasets only a few of which will be queried by each user, then that seems like something you can build into the web server layer.

So basically, the limit on the number of datasets that can be queried at once is fixed by the number of datasets you can fit in memory at the same time?
If a WGS VCF output from the DRAGEN germline analysis pipeline is approximately 400 MB per genome (block compressed), and file footprint in Hail is approximately same as compressed VCF size, then I could only analyze 320 genomes at a time on 128 GB machine (optimal scenario)?

The annotated VCF size from GSA-chip genotyping output is about 30 MB (vcf.gz) (around 700K variants); does this mean I could only analyze about 4,300 samples at a time on same 128 GB machine?

These types of calculations really help me understand the scalability of the system…
Thanks again.

No, definitely not. If Hail had to fit an an entire dataset in memory, then the analysis for gnomAD would have required a machine with 100s of terabytes of memory. That’s not scalability!

The memory requirements of Hail and Spark are somewhat opaque and vary by operation, but generally don’t scale with data size. The reason I’m worried about 1M VCFs is that there is some overhead per file.

However, the larger problem is that it seems like your data doesn’t fit the form mentioned in the import vcf docs (files with the same samples, split by chromosomal region) – you have one file per sample.

Hail doesn’t support reading data in this orientation at the moment. We do have a gVCF import feature coming in 3-6 months, though.