New sparkinfo command and logging of number of partitions and persistence level

jbloom · November 2, 2016, 6:22pm

Under the Spark model, variant datasets are internally broken up into chunks called partitions which can be persisted on disk or cached in memory; see the Spark programming guide for details.

Since the number of partitions and persistence level can significantly affect performance, we’ve added a sparkinfo command to print this information to the screen. We’ve also added automatic logging of this information just before any command that requires a variant dataset in state. For example, after running

hail \
  importvcf -i src/test/resources/sample.vcf \
  repartition -n 4 \
  splitmulti \
  cache \
  pca -k 2 -s 'sa.pc' \
  write -o sample.vds

running grep 'sparkinfo' hail.log on the command line returns

2016-11-02 14:28:35 INFO  Hail:256 - sparkinfo: repartition, 1 partitions, NONE
2016-11-02 14:28:35 INFO  Hail:256 - sparkinfo: splitmulti, 4 partitions, NONE
2016-11-02 14:28:35 INFO  Hail:256 - sparkinfo: cache, 4 partitions, NONE
2016-11-02 14:28:35 INFO  Hail:256 - sparkinfo: pca, 4 partitions, MEMORY_ONLY
2016-11-02 14:28:36 INFO  Hail:256 - sparkinfo: write, 4 partitions, MEMORY_ONLY

To control these values, use the repartition, cache, and persist commands.

Topic		Replies	Views
How is MatrixTable Entry data partitioned? Development	2	272	October 20, 2023
Hail Repartition returns uneven partitions with one very large partition Hail Query & hailctl	5	407	March 21, 2023
Questions about optimizing Hail and Spark configs and estimating resources and runtimes Hail Query & hailctl	3	1019	December 1, 2022
Table partitioning Hail Query & hailctl	1	388	July 26, 2021
Shuffling and writing a MatrixTable appears to run the shuffle op twice Hail Query & hailctl	2	376	August 23, 2021

New sparkinfo command and logging of number of partitions and persistence level

Related Topics