Preventing Out-Of-Memory Errors in Dask through Automated Memory-Aware Chunking

Daniel De Lucca Fonseca; Carlos Alberto Astudillo Trujillo; Edson Borin

doi:10.5753/sscad.2025.15848

Daniel De Lucca Fonseca UNICAMP
Carlos Alberto Astudillo Trujillo UNICAMP
Edson Borin UNICAMP

DOI: https://doi.org/10.5753/sscad.2025.15848

Resumo

Data-parallel frameworks like Dask partition datasets into chunks for concurrent execution, but choosing suitable chunk dimensions remains challenging: oversized chunks cause Out-Of-Memory (OOM) failures while undersized chunks reduce performance. This paper introduces memory-aware chunking that predicts peak memory from input shapes using linear regression and automatically derives optimal chunk sizes adapted to each operator’s memory requirements. Evaluation on seismic imaging operators across 768 trials shows complete elimination of OOM failures (versus 31.6% failure rate for Dask’s default chunking) and 52% peak memory reduction, enabling reliable distributed processing in memory-constrained environments.

Referências

Khandelwal, A., Kejariwal, A., and Ramasamy, K. (2020). Cleo: A cost-optimizer for mapreduce workloads. In Proceedings of the 2020 USENIX Annual Technical Conference, pages 533–546. USENIX Association.

Li, X., Qi, N., He, Y., and McMillan, B. (2019). Practical resource usage prediction method for large memory jobs in hpc clusters. In Abramson, D. and de Supinski, B. R., editors, Supercomputing Frontiers, pages 1–18, Cham. Springer International Publishing.

Myung, J. and Lee, J. (2021). Memory-harvesting vms in cloud platforms. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 583–599. ACM.

Rocklin, M. (2015). Dask: Parallel computation with blocked algorithms and task scheduling. In Huff, K. and Bergstra, J., editors, Proceedings of the 14th Python in Science Conference, pages 130–136.

Rodrigues, E. R., Cunha, R. L. F., Netto, M. A. S., and Spriggs, M. (2016). Helping hpc users specify job memory requirements via machine learning. In 2016 Third International Workshop on HPC User Support Tools (HUST), pages 6–13.

Tanash, M., Andresen, D., and Hsu, W.-J. (2021). AMPRO-HPCC: A machine-learning tool for predicting resources on slurm hpc clusters. In ADVCOMP: International Conference on Advanced Engineering Computing and Applications in Sciences, pages 20–27. PMID: 36760802; PMCID: PMC9906793.

Tantisiriroj, W., Son, S. W., Patil, S., Lang, S. J., Gibson, G., and Ross, R. B. (2011). On the duality of data-intensive file system design: Reconciling hdfs and pvfs. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12. ACM.

Thamsen, L., Verbitskiy, I., Schmidt, F., Renner, T., and Kao, O. (2017). Mary, hugo, and hugo*: Learning to schedule distributed data-parallel processing jobs on shared clusters. In Euro-Par 2017: Parallel Processing, pages 81–92. Springer.

Zhang, W., Jiang, S., Catlett, C., and Ravi, S. S. (2019). Adaptive data placement for staging-based coupled scientific workflows. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–23. ACM.