Is the scala code for Nirvana.scala up to date or deprecated

Hi, I am going to make an attempt to get Nirvana annotations to work with Hail (work for Illumina, so, can’t go around annotating with VEP no can I? :slight_smile:

I was trying to get it to work (even though it will not work with most recent Nirvana) and compared it to the way VEP was implemented and noticed some stark differences.

For one, it seems that the VEP configuration json file needs to be stored on the Hadoop HDFS, while for Nirvana the config files seems to have to be stored on the local disk of the nodes (all nodes?).

But also ran into some issues with permissions denied when trying to run some wrapper script (not running the dotnet command directly, since I need to stream the VCFs and Nirvana insists on a file) that are stored locally but not sure that the Spark implementation on YARN etc. that usually does not access local disk…

In any case, is the code in Nirvana.scala a good starting point to get Nirvana going again, or should I use whatever is done in VEP and attempt to make it work with Nirvana? What is your recommendation?

I think the nirvana code isn’t actively tested (and so has rotted most likely) but that’s still a great place to start. We can fix Nirvana to read a config file using the network file system instead of local file system, that’s quite easy.

I got it to work to some extent (although it does not seem to parse the JSON file output :slight_smile: but what changes would I need to make to make it read the files form the HDFS? I think VEP is doing it so could look at that code…But any hints?

I am also wondering if I need to put all the cache files etc. on the HDFS, but not sure if Nirvana “knowns” how to read files from HDFS, so guessing those should stay on the local file system? The wrapper script I wrote seems to only work if I put it on HDFS, but that is probably becuase SCALA/JAVA is starting that, but when running, Nirvana does not know about HDFS I guess? Still wrapping my head around how spark and non-spark apps work together…

I’m skeptical about putting all the cache files on HDFS. Nirvana would likely have to be modified for that to work.

As for reading the config file from HDFS, I think what you want to look at is that Nirvana reads input using a normal Java FileInputStream that just looks at local file system:

    val properties = try {
      val p = new Properties()
      val is = new FileInputStream(config)
      p.load(is)
      is.close()
      p
    } catch {
      case e: IOException =>
        fatal(s"could not open file: ${ e.getMessage }")
    }

Whereas VEP uses fs.open, which is using our filesystem interface that uses HDFS:

  def readConfiguration(fs: FS, path: String): VEPConfiguration = {
    val jv = using(fs.open(path)) { in =>
      JsonMethods.parse(in)
    }
    implicit val formats: Formats = defaultJSONFormats + new TStructSerializer
    jv.extract[VEPConfiguration]
  }
1 Like

Yeah, putting the cache files on hadoop is not going to work (easily). Thanks for the pointer on how to get it to read the config file from Hadoop…That will make it at least a little easier, althoug since I already have to bootstrap all the nodes to have the cache files, getting the small config file there as well is trivial, but on hadoop I can easily modify for all nodes…