VDS Combiner Restart from checkpoint

Hello all! I am interested to use hl.vds.new_combiner to joint call around 15k samples. As this is a relatively large dataset, I would like to seek clarification on two points:

  1. For restarting from a failed execution, can I check if how it works is to just define a save_path in hl.vds.new_combiner, and then just running the combiner with the same arguments again to resume from the checkpoint?
  2. The maximum time for the jobs at my institution is 14 days. Hence, when the job stops (i.e. manual force stop of the job), can the joint calling be resumed with a new job from the save_path by running the same function again?

Really appreciate your insights!

Both of these are correct, the combiner’s state it saved between executions. If you run the same script with the same version of hail and don’t pass force=True to new_combiner, then it will reuse the same plan and pick up where it left off.

Thank you very much for your response @chrisvittal ! I would like to seek your advice on another matter. I am running it on a small subset of 1000 samples, and I already ran into this issue:

Error summary: FileNotFoundException: ./hail_tmp/combiner-intermediates/e431b2db-2725-44a6-a869-335c29e76d53_gvcf-combine_job1/dataset_0.vds/reference_data/index/part-26-0-26-0-0ce32f55-820d-7ea9-42c9-9b24d305c8ef.idx/metadata.json.gz (Too many open files)

I am inferring that this is due to having too many partitions (~8000 partitions at the Stage where it failed). I am using WES samples and hence I ticked use_exome_default_intervals=True. However, to rectify the above issue, would you recommend to increase the interval size with import_interval_size? Thank you very much!