Hi @hail-team !
DNARecords has recently been released. It allows exporting a Hail expression into a dataset of tfrecords or parquet files ready to use with machine learning frameworks like Tensorflow or Pytorch.
The whole thing is implemented on top of Hail, so it is ready to scale as much as Hail does. Of course, it takes advantage of sparsity. And the final dataset can be generated variant wise, sample wise or both at once.
For more details, you can review:
I wonder whether it could be useful to integrate it into Hail rather than having a different package for this kind of stuff.
Please, let me know what you think. Hopefully, it could be quite straight forward to make this integration into Hail.
Hey @amanas !
This is very cool! It’s great to see so much innovation in the storage of large sequencing datasets.
Unfortunately, we don’t have bandwidth to integrate this ourselves. We do welcome pull requests! There’s a few issues we’d need to address to get to a pull request:
- We don’t want required dependencies on new libraries, like tfrecords. I think that part is best kept as a separate Python package that users can install if they’re interested.
- Hail’s import / export procedures are all functions with names like
export_X. You’ll need to restructure your code into, for example, an
- We are moving away from Apache Spark and prefer all new functionality to include a non-Spark option. In this case, your data representation should be expressible in terms of Hail Tables (just like how you currently express it in terms of Spark Datasets). We also recently released
Table.write_many which can be used to write many Tables in parallel. You should use this instead of a ThreadPool.
- We can only accept a pull request that has short & fast tests that don’t depend on external libraries (e.g. tfrecords).
If you have bandwidth to do the above, we can collaborate. If not, I think a separate Python package is actually a pretty good option!