DNARecords integration?

amanas · September 5, 2022, 6:59am

DNARecords has recently been released. It allows exporting a Hail expression into a dataset of tfrecords or parquet files ready to use with machine learning frameworks like Tensorflow or Pytorch.

The whole thing is implemented on top of Hail, so it is ready to scale as much as Hail does. Of course, it takes advantage of sparsity. And the final dataset can be generated variant wise, sample wise or both at once.

For more details, you can review:

the docs: DNARecords — dnarecords documentation
the paper: DNARecords: An extensible sparse format for petabyte scale genomics analysis | bioRxiv
the code: GitHub - amanas/dnarecords: Genomics data ML ready

I wonder whether it could be useful to integrate it into Hail rather than having a different package for this kind of stuff.

Please, let me know what you think. Hopefully, it could be quite straight forward to make this integration into Hail.

Regards,

Andrés.

danking · September 20, 2022, 8:13pm

Hey @amanas !

This is very cool! It’s great to see so much innovation in the storage of large sequencing datasets.

Unfortunately, we don’t have bandwidth to integrate this ourselves. We do welcome pull requests! There’s a few issues we’d need to address to get to a pull request:

We don’t want required dependencies on new libraries, like tfrecords. I think that part is best kept as a separate Python package that users can install if they’re interested.
Hail’s import / export procedures are all functions with names like import_X and export_X. You’ll need to restructure your code into, for example, an export_dnarecords_parquet function.
We are moving away from Apache Spark and prefer all new functionality to include a non-Spark option. In this case, your data representation should be expressible in terms of Hail Tables (just like how you currently express it in terms of Spark Datasets). We also recently released Table.write_many which can be used to write many Tables in parallel. You should use this instead of a ThreadPool.
We can only accept a pull request that has short & fast tests that don’t depend on external libraries (e.g. tfrecords).

If you have bandwidth to do the above, we can collaborate. If not, I think a separate Python package is actually a pretty good option!

Topic		Replies	Views
GenomicsDB integration Help [0.1]	2	1080	March 2, 2018
Announcing Hail 0.2! Updates	2	4900	October 22, 2018
Best practices for UK Biobank Imputed Data Hail Query & hailctl	9	1371	March 19, 2022
Using hail features at scale in the cloud Hail Query & hailctl	2	341	August 30, 2022
Exporting a 20M variant x 400K sample MatrixTable to (ideally) BGEN format Hail Query & hailctl	2	538	November 20, 2019

DNARecords integration?

Related topics