Availability of files in hail-data?

hammer · January 13, 2020, 5:32pm

We’re trying to identify places where Hail (and Glow) really lag PLINK in performance, and in the process, have been seeking example data sets to see how performance scales up.

It looks like y’all have a nice collection of test data under hail-data, including profile225.vds which seems to be most commonly used to test performance at scale.

Is it possible for other developers to access a subset of the data in this folder that does not require data use agreements?

tpoterba · January 13, 2020, 9:38pm

All of this data should be public – the profile* datasets are just differently-sized chunks of the low-depth thousand genomes release. It’s also quite easy to use Hail to simulate genotype data (though not with linkage structure at the moment):

I think this is a great project, and we’re painfully aware that many operations in Hail are orders of magnitude slower than PLINK on a single core. There is a path to having roughly comparable performance, but I estimate it’ll take at least a year (probably longer) before we have a code generator that emits vectorized instructions.

Please let us know if there’s any way we can help this effort along!

Topic		Replies	Views
Support for phased genotypes Help [0.1]	5	879	September 19, 2020
List of Various Beginner Questions Hail Query & hailctl	1	697	November 18, 2018
Hail curious potential user Q Help [0.1]	8	1607	March 7, 2017
Is hail a good option for simple querying tasks on a large dataset (using as a "db")? Hail Query & hailctl	4	360	May 15, 2023
Big picture issues: considering switching to HAIL Meta	6	3890	January 3, 2023

Availability of files in hail-data?

Related topics