New sparkinfo command and logging of number of partitions and persistence level


#1

Under the Spark model, variant datasets are internally broken up into chunks called partitions which can be persisted on disk or cached in memory; see the Spark programming guide for details.

Since the number of partitions and persistence level can significantly affect performance, we’ve added a sparkinfo command to print this information to the screen. We’ve also added automatic logging of this information just before any command that requires a variant dataset in state. For example, after running

hail \
  importvcf -i src/test/resources/sample.vcf \
  repartition -n 4 \
  splitmulti \
  cache \
  pca -k 2 -s 'sa.pc' \
  write -o sample.vds

running grep 'sparkinfo' hail.log on the command line returns

2016-11-02 14:28:35 INFO  Hail:256 - sparkinfo: repartition, 1 partitions, NONE
2016-11-02 14:28:35 INFO  Hail:256 - sparkinfo: splitmulti, 4 partitions, NONE
2016-11-02 14:28:35 INFO  Hail:256 - sparkinfo: cache, 4 partitions, NONE
2016-11-02 14:28:35 INFO  Hail:256 - sparkinfo: pca, 4 partitions, MEMORY_ONLY
2016-11-02 14:28:36 INFO  Hail:256 - sparkinfo: write, 4 partitions, MEMORY_ONLY

To control these values, use the repartition, cache, and persist commands.


Using Hail on the Google Cloud Platform