Can OpenMP Scale Beyond the Node? A Performance Evaluation of Remote Offloading via the MPI Proxy Plugin

Jhonatan Cléto; Guilherme Valarini; Hervé Yviquel

doi:10.5753/sscad.2025.16707

Jhonatan Cléto UNICAMP
Guilherme Valarini UNICAMP
Hervé Yviquel UNICAMP

DOI: https://doi.org/10.5753/sscad.2025.16707

Resumo

We evaluate the MPI Proxy Plugin (MPP), an extension to LLVM/OpenMP that enables remote offloading of target regions via MPI, allowing distributed GPU execution without modifying application code. Using benchmarks on NVIDIA H100 and AMD MI300A nodes, we show that MPP delivers competitive performance compared to traditional MPI+OpenMP, particularly for coarse-grained, compute-intensive workloads. While MPP simplifies development and enables communication–computation overlap, it introduces higher runtime overheads and task granularity requirements. Our results position MPP as a promising step toward unifying shared and distributed heterogeneous programming under the OpenMP model.

Referências

Antao, S. F., Bataev, A., Jacob, A. C., Bercea, G.-T., Eichenberger, A. E., Rokos, G., Martineau, M., Jin, T., Ozen, G., Sura, Z., Chen, T., Sung, H., Bertolli, C., and O’Brien, K. (2016). Offloading Support for OpenMP in Clang and LLVM. In 2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), pages 1–11.

Bauer, M., Treichler, S., Slaughter, E., and Aiken, A. (2012). Legion: Expressing locality and independence with logical regions. In SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–11.

Bosilca, G., Harrison, R., Herault, T., Javanmard, M., Nookala, P., and Valeev, E. (2020). The template task graph (ttg) - an emerging practical dataflow programming paradigm for scientific simulation at extreme scale. In 2020 IEEE/ACM Fifth International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), pages 1–7.

Ceccato, R., Cléto, J., Leite, G., Rigo, S., Diaz, J. M. M., and Yviquel, H. (2024). Spinner: Enhancing HPC Experimentation with a Streamlined Parameter Sweep Tool. In 2024 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), pages 1–11.

Chatterjee, S., Taşirlar, S., Budimlić, Z., Cavé, V., Chabbi, M., Grossman, M., Sarkar, V., and Yan, Y. (2013). Integrating asynchronous task parallelism with MPI. Proceedings - IEEE 27th International Parallel and Distributed Processing Symposium, IPDPS 2013, pages 712–725.

Cléto, J., Valarini, G., Pereira, M., Araujo, G., and Yviquel, H. (2025). Scalable OpenMP Remote Offloading via Asynchronous MPI and Coroutine-Driven Communication. In Proceedings of the 31st International Conference on Parallel Processing (Euro-Par 2025), Lecture Notes in Computer Science, Dresden, Germany. Springer.

Davis, J. H., Sivaraman, P., Kitson, J., Parasyris, K., Menon, H., Minn, I., Georgakoudis, G., and Bhatele, A. (2024). Taking GPU Programming Models to Task for Performance Portability.

Heroux, M., Bouteiller, A., Herault, T., Cao, Q., Schuchart, J., and Bosilca, G. (2025). PaRSEC: Scalability, flexibility, and hybrid architecture support for task-based applications in ECP. Int. J. High Perform. Comput. Appl., 39(1):147–166.

Huber, J., Cornelius, M., Georgakoudis, G., Tian, S., Diaz, J. M. M., Dinel, K., Chapman, B., and Doerfert, J. (2022). Efficient Execution of OpenMP on GPUs. In 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 41–52.

Kale, L. V. and Krishnan, S. (1993). Charm++: a portable concurrent object oriented system based on c++. In Proceedings of the Eighth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA ’93, page 91–108, New York, NY, USA. Association for Computing Machinery.

Laguna, I., Marshall, R., Mohror, K., Ruefenacht, M., Skjellum, A., and Sultana, N. (2019). A large-scale study of MPI usage in open-source HPC applications. International Conference for High Performance Computing, Networking, Storage and Analysis, SC.

Martineau, M. and McIntosh-Smith, S. (2017). The Productivity, Portability and Performance of OpenMP 4.5 for Scientific Applications Targeting Intel CPUs, IBM CPUs, and NVIDIA GPUs. In de Supinski, B. R., Olivier, S. L., Terboven, C., Chapman, B. M., and Müller, M. S., editors, Scaling OpenMP for Exascale Performance and Portability, pages 185–200, Cham. Springer International Publishing.

MPI Forum (2023). MPI: A Message-Passing Interface Standard Version 4.1.

OpenMP (2024). Openmp application programming interface version 6.0. Technical report, OpenMP Architecture Review Board.

Schuchart, J., Nookala, P., Herault, T., Valeev, E. F., and Bosilca, G. (2022). Pushing the Boundaries of Small Tasks: Scalable Low-Overhead Data-Flow Programming in TTG. In 2022 IEEE International Conference on Cluster Computing (CLUSTER), pages 117–128.

Slaughter, E., Wu, W., Fu, Y., Brandenburg, L., Garcia, N., Kautz, W., Marx, E., Morris, K. S., Cao, Q., Bosilca, G., Mirchandaney, S., Leek, W., Treichlerk, S., McCormick, P., and Aiken, A. (2020). Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15.