Under the Spark model, variant datasets are internally broken up into chunks called partitions which can be persisted on disk or cached in memory; see the Spark programming guide for details.
Since the number of partitions and persistence level can significantly affect performance, we’ve added a sparkinfo command to print this information to the screen. We’ve also added automatic logging of this information just before any command that requires a variant dataset in state. For example, after running
hail \
importvcf -i src/test/resources/sample.vcf \
repartition -n 4 \
splitmulti \
cache \
pca -k 2 -s 'sa.pc' \
write -o sample.vds
running grep 'sparkinfo' hail.log
on the command line returns
2016-11-02 14:28:35 INFO Hail:256 - sparkinfo: repartition, 1 partitions, NONE
2016-11-02 14:28:35 INFO Hail:256 - sparkinfo: splitmulti, 4 partitions, NONE
2016-11-02 14:28:35 INFO Hail:256 - sparkinfo: cache, 4 partitions, NONE
2016-11-02 14:28:35 INFO Hail:256 - sparkinfo: pca, 4 partitions, MEMORY_ONLY
2016-11-02 14:28:36 INFO Hail:256 - sparkinfo: write, 4 partitions, MEMORY_ONLY
To control these values, use the repartition, cache, and persist commands.