GenomicsDB integration

First of all, thank you very much for your work in Hail. I’ve only started using it recently and I can already see how much potential it has!

I also recently started using GenomicsDB and saw that there’s an integration in Hail’s master branch. I tried testing it with your sample data (sample2loader.json, sample2callsets.json, etc), and if I use the workspace you provide (tdbworkspace), everything works fine. However, If I create a workspace and load your sample information with my local GenomicsDB installation, it doesn’t. The commands I use are:

./create_tiledb_workspace <workspace_path>

./vcf2tiledb <loader_path>

And I already tried running them with both versions 0.6.4 and 0.8.1. The only things I edit from your loader, vid and callsets files are the paths, and the errors I get are:

  • In version 0.6.4:

java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at is.hail.utils.ArrayStack$mcI$sp.top$mcI$sp(ArrayStack.scala:23)
at is.hail.annotations.RegionValueBuilder.setMissing(RegionValueBuilder.scala:172)
at is.hail.io.vcf.HtsjdkRecordReader.readVariantInfo(HtsjdkRecordReader.scala:35)
at is.hail.io.vcf.HtsjdkRecordReader.readRecord(HtsjdkRecordReader.scala:67)
at is.hail.io.vcf.LoadGDB$$anonfun$3.apply(LoadGDB.scala:182)
at is.hail.io.vcf.LoadGDB$$anonfun$3.apply(LoadGDB.scala:178)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at is.hail.io.vcf.LoadGDB$.apply(LoadGDB.scala:184)
… 71 elided

  • In version 0.8.1:

terminate called after throwing an instance of ‘std::length_error’ what(): basic_string::resize’.

When I replace my file (generated with my local GenomicsDB installation) tdbsorkspace/sample2Array/__array_schema.tdb with your file, the errors dissappear. Am I missing something?

Thanks,

Cristina.

Hi Cristina,
I would describe the state of GenomicsDB in Hail as an early prototype. There are two problems with supporting it in a more direct way:

  • GenomicsDB isn’t designed to work on files stored in a distributed file system like Google Storage or Amazon S3. Most Hail users are running on Google Cloud, so we want to make that experience nice.
  • GenomicsDB itself is in development and changing quickly.
    As such, I don’t anticipate much development going into Hail GenomicsDB support in the next several months.

What are some of the reasons you’re using it? In particular, the incremental add of gVCFs is very useful.

Hi Tim,

Thanks, I’ll keep that in mind. Yes, the incremental add is precisely why I need it!

Cristina.