Merge multiple sparse MT to one sparse MT

Hello,

Just in case you did not see my question in the reply in my other post, I ask it again here.

I updated Hail, as of version 0.2.56 and tested run_combiner() function with 100, 1k, and 10k gvcf files each (average size : 6G). Run_combiner() runs for 100 gvcfs and 1k gvcfs completed successfully through multiple attempts of failed subtasks showing similar messages as before. Run time in a new version was faster than in the previous Hail version. Thanks much for your and your team’s work.

The job for 10k gvcfs has been failed. The first round of 100 batches, merging 100 gvcfs to a sparse MT, completed successfully, but it failed when starting the second round with the error message as below. If you let me know how to resolve this issue, I will really appreciate it.

– Caused by: java.io.IOException: All datanodes [DatanodeInfoWithStorage[*******,DISK]] are bad

And, I found out 100 sparse MTs generated by the first round in a run_combiner() run in my temp storage.
Is it possible to combine 100 MTs to 1 MT with any other Hail function?

Thank you.
-Jina

Thanks very much for your patience on this. Could you share your call the the combiner? What temporary directory are you using? The “data nodes are bad” error might mean that nodes are full, I think that’s the exception HDFS throws when it runs out of storage space.

Hi Tim @tpoterba,

Thanks for your reply. I have attached a Python script to call the run_combiner function and my commands to start a cluster and to submit a job. I hope it would helpful for solving my issue.

By the way, “out of storage space” you mentioned, where does it mean and how can I increase it?

-Jina

< Cluster on GCP>

  • my-auto-policy [myauto] : max primary : 10 , max secondary : 1000

hailctl dataproc start [mycluster] --vep GRCh38 --labels=mt=hm8-10k --autoscaling-policy=[myauto] --master-machine-type=n1-highmem-8 --worker-machine-type=n1-highmem-8 --properties=dataproc:dataproc.logging.stackdriver.enable=true,dataproc:dataproc.monitoring.stackdriver.enable=true

< Job >

hailctl dataproc submit [mycluster] run_gvcf_combiner.py

< Function call in run_gvcf_combiner.py >

output_file = ‘gs://[my_bucket]/[MT_folder]/10k_20200817.mt’ # output destination
temp_bucket = ‘gs://[my_bucket]/[temp_folder]/’ # bucket for storing intermediate files

hl.experimental.run_combiner(inputs, out_file=output_file, tmp_path=temp_bucket, branch_factor=100, batch_size=100, reference_genome=‘GRCh38’, use_genome_default_intervals=True)

Is that the full autoscaling policy? What’s the workerConfig minInstances?

You can see my autoscaling policy here.

  • workerConfig:
    maxInstances: 10
    minInstances: 2
    weight: 1
  • secondaryWorkerConfig:
    maxInstances: 1000
    weight: 1
  • basicAlgorithm:
    cooldownPeriod: 2m
    yarnConfig:
    – scaleUpFactor: 1.0
    – scaleDownFactor: 1.0
    – gracefulDecommissionTimeout: 120s

Hi @tpoterba and @chrisvittal,

If you let me know how to solve the “out of storage space” error or(and) how to merge multiple sparse matrix tables to one sparse matrkx table, I will really appreciate it.

I am looking forward to your reply.

-Jina