New sparkinfo command and logging of number of partitions and persistence level

Under the Spark model, variant datasets are internally broken up into chunks called partitions which can be persisted on disk or cached in memory; see the Spark programming guide for details.

Since the number of partitions and persistence level can significantly affect performance, we’ve added a sparkinfo command to print this information to the screen. We’ve also added automatic logging of this information just before any command that requires a variant dataset in state. For example, after running

hail \
  importvcf -i src/test/resources/sample.vcf \
  repartition -n 4 \
  splitmulti \
  cache \
  pca -k 2 -s 'sa.pc' \
  write -o sample.vds

running grep 'sparkinfo' hail.log on the command line returns

2016-11-02 14:28:35 INFO  Hail:256 - sparkinfo: repartition, 1 partitions, NONE
2016-11-02 14:28:35 INFO  Hail:256 - sparkinfo: splitmulti, 4 partitions, NONE
2016-11-02 14:28:35 INFO  Hail:256 - sparkinfo: cache, 4 partitions, NONE
2016-11-02 14:28:35 INFO  Hail:256 - sparkinfo: pca, 4 partitions, MEMORY_ONLY
2016-11-02 14:28:36 INFO  Hail:256 - sparkinfo: write, 4 partitions, MEMORY_ONLY

To control these values, use the repartition, cache, and persist commands.