Hello,
I have been running a python hail script on chromosome 20 that is doing sample qc, variant_qc and some basic filtering steps on google cloud using 64 workers. It is exporting tables and writing out matrixtables in various checkpoints. The script completed in around 5 hours.
When I doubled the number of workers to 128, the process did not speed up but still took 5 hours. Why do you think doubling the number of workers did not improve running time with the same dataset?
What are the parameters that would help speed up the process especially as we ar enow scaling to chromosome 1 and whole genome eventually?
This is my current configuration:
âimage-version=1.4-debian9
âproperties=spark:spark.driver.maxResultSize=0,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,hdfs:dfs.replication=1,dataproc:dataproc.logging.stackdriver.enable=false,dataproc:dataproc.monitoring.stackdriver.enable=false,spark:spark.driver.memory=41g
âinitialization-actions=gs://hail-common/hailctl/dataproc/0.2.20/init_notebook.py
âmetadata=^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.20/hail-0.2.20-py3-none-any.whl|||PKGS=aiohttp|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|hurry.filesize==0.9|ipykernel<5|nest_asyncio|numpy<2|pandas>0.22,<0.24|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3|PyYAML
âmaster-machine-type=n1-highmem-8
âmaster-boot-disk-size=100GB
ânum-master-local-ssds=0
ânum-preemptible-workers=0
ânum-worker-local-ssds=0
ânum-workers=64
âpreemptible-worker-boot-disk-size=40GB
âworker-boot-disk-size=40
âworker-machine-type=n1-standard-8
âzone=europe-west2-a
âinitialization-action-timeout=20m
âlabels=creator=pa10_sanger_ac_uk```