Vds hdfs reshuffle question


We noticed a “strange behavior” when we try to load a VDS (~7400 1GB partitions) to Hail when Yarn always allocate many executors to one particular worker node before allocating 1 or no executors to the remaining node. I’m not sure if this is data locality problem or not. Is there a remedy you can suggest like reshuffling partitions, changing parquet block size?


I think Yarn doesn’t know about data locality at the time it assigns workers. I bet there’s a setting for whether to assign widely or narrowly.

Thanks I will ask our IT to look into it to see if there’s such a setting.