Stage contains a task of very large size

We have a very small VCF (less than 5M after vep annotation), we tried to write it into mt file after annotated with some references. However, it took forever to finish. We ran for more then 6 hours for a time which doesn’t sounds right. We had some other VCFs, like 100 times bigger than this one, we finished the annotation and writing to MT file without problem.
Our workflow step is very similar to the one below.

I reviewed the hail log, the only specious lines are listed below, I am wondering what might cause a task to be too large? Thank you! We are still using hail 0.2.57. I interrupted this job for this time.

2022-04-01 03:29:38 DAGScheduler: INFO: Submitting 12 missing tasks from ResultStage 6 (MapPartitionsRDD[256] at mapPartitions at ContextRDD.scala:160) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11))
2022-04-01 03:29:38 YarnScheduler: INFO: Adding task set 6.0 with 12 tasks
2022-04-01 03:29:38 TaskSetManager: WARN: Stage 6 contains a task of very large size (5223 KB). The maximum recommended task size is 100 KB.

hail-20220401-0324-0.2.57-582b2e31b8bd.txt (4.1 MB)

Hey @SimonLi5601 !

It’s a bit hard to comment without the exact code that y’all are executing. Is it possible to share that?

Also, how many variants did you start with? How many samples did you start with? Generally, things are quite a bit slower when starting with a VCF as opposed to a Hail MatrixTable. Seeing the exact code you ran will help us nail down the source of slowness.

I think the problem here is a lack of parallelism within your VCF. Joining a small left-side dataset against very large right-side datasets leads to very inefficient execution in Hail right now. You can fix this by importing your VCF with a bunch of partitions, using the default (file size divided by ~32MB) only gave you 12.

hl.import_vcf(path, min_partitions=1000)

@danking @tpoterba Thanks for your reply. When we import_vcf, we did set it min_partitions to 500. Somehow after VEP annotation, it is reset to 12. I used mt.repartition to 1000, it didn’t help much. It slows down significantly in the stage when 30% tasks slow. It could be a combination issue of data and software. We are trying to upgrade Hail (Hadoop, Spark, ElasticSearch accordantly).