We’re trying to identify places where Hail (and Glow) really lag PLINK in performance, and in the process, have been seeking example data sets to see how performance scales up.
It looks like y’all have a nice collection of test data under
profile225.vds which seems to be most commonly used to test performance at scale.
Is it possible for other developers to access a subset of the data in this folder that does not require data use agreements?
All of this data should be public – the
profile* datasets are just differently-sized chunks of the low-depth thousand genomes release. It’s also quite easy to use Hail to simulate genotype data (though not with linkage structure at the moment):
I think this is a great project, and we’re painfully aware that many operations in Hail are orders of magnitude slower than PLINK on a single core. There is a path to having roughly comparable performance, but I estimate it’ll take at least a year (probably longer) before we have a code generator that emits vectorized instructions.
Please let us know if there’s any way we can help this effort along!