Hello,
I have been running a python hail script on chromosome 20 that is doing sample qc, variant_qc and some basic filtering steps on google cloud using 64 workers. It is exporting tables and writing out matrixtables in various checkpoints. The script completed in around 5 hours.
When I doubled the number of workers to 128, the process did not speed up but still took 5 hours. Why do you think doubling the number of workers did not improve running time with the same dataset?
What are the parameters that would help speed up the process especially as we ar enow scaling to chromosome 1 and whole genome eventually?
This is my current configuration:
āimage-version=1.4-debian9
āproperties=spark:spark.driver.maxResultSize=0,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,spark:spark.driver.extraJavaOptions=-Xss4M,spark:spark.executor.extraJavaOptions=-Xss4M,hdfs:dfs.replication=1,dataproc:dataproc.logging.stackdriver.enable=false,dataproc:dataproc.monitoring.stackdriver.enable=false,spark:spark.driver.memory=41g
āinitialization-actions=gs://hail-common/hailctl/dataproc/0.2.20/init_notebook.py
āmetadata=^|||^WHEEL=gs://hail-common/hailctl/dataproc/0.2.20/hail-0.2.20-py3-none-any.whl|||PKGS=aiohttp|bokeh>1.1,<1.3|decorator<5|gcsfs==0.2.1|hurry.filesize==0.9|ipykernel<5|nest_asyncio|numpy<2|pandas>0.22,<0.24|parsimonious<0.9|PyJWT|python-json-logger==0.1.11|requests>=2.21.0,<2.21.1|scipy>1.2,<1.4|tabulate==0.8.3|PyYAML
āmaster-machine-type=n1-highmem-8
āmaster-boot-disk-size=100GB
ānum-master-local-ssds=0
ānum-preemptible-workers=0
ānum-worker-local-ssds=0
ānum-workers=64
āpreemptible-worker-boot-disk-size=40GB
āworker-boot-disk-size=40
āworker-machine-type=n1-standard-8
āzone=europe-west2-a
āinitialization-action-timeout=20m
ālabels=creator=pa10_sanger_ac_uk```