Interface para Programação de Pipelines Lineares Tolerantes a Falha para MPI Padrão C++

Eduardo M. Martins; Renato B. Hoffmann; Lucas M. Alf; Dalvan Griebler

doi:10.5753/sscad.2025.15867

Eduardo M. Martins PUCRS
Renato B. Hoffmann PUCRS
Lucas M. Alf PUCRS
Dalvan Griebler PUCRS

DOI: https://doi.org/10.5753/sscad.2025.15867

Resumo

Sistemas de processamento de stream são projetados para operar continuamente e devem ser capazes de se recuperar em caso de falhas. No entanto, programar aplicações de alto desempenho em ambientes distribuídos introduz uma alta complexidade de desenvolvimento. Este trabalho apresenta uma interface de programação que facilita a construção de pipelines lineares tolerantes a falhas para aplicações de processamento de stream em C++. A solução utiliza MPI (Message Passing Interface) para comunicação e o protocolo ABS (Asynchronous Barrier Snapshotting) juntamente com um agente monitor para a etapa de recuperação. Os resultados experimentais indicam uma redução significativa no tempo estimado de desenvolvimento para o programador, com impacto médio de -0.98% até 6.73% na vazão das aplicações. Além disso, o processo de recuperação mitiga o impacto das falhas na vazão do programa.

Referências

Akidau, T. et al. (2015). The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow., 8(12):1792–1803.

Alf, L. M. and Griebler, D. (2025). Fault tolerance for high-level parallel and distributed stream processing in C++. Master’s thesis, School of Technology - PPGCC - PUCRS, Porto Alegre, Brazil.

Andrade, G. et al. (2022). Analyzing Programming Effort Model Accuracy of High-Level Parallel Programs for Stream Processing. In 48th SEAA 2022, pages 229–232, Gran Canaria, Spain. IEEE.

Apache (2025). Apache Spark Streaming.

Bland, W. et al. (2013). Post-failure recovery of mpi communication capability: Design and rationale. IJHPCA, 27(3):244–254.

Bouteiller, A. et al. (2006). Mpich-v project: A multiprotocol automatic fault-tolerant mpi. IJHPCA, 20(3):319–333.

Carbone, P. et al. (2015). Lightweight Asynchronous Snapshots for Distributed Dataflows.

Carbone, P. et al. (2017). State Management in Apache Flink: Consistent Stateful Distributed Stream Processing. Proc. VLDB Endow., 10(12):1718–1729.

Consel, C. et al. (2003). Spidle: A dsl approach to specifying streaming applications. In Pfenning, F. and Smaragdakis, Y., editors, GPCE, pages 1–17. Springer.

Egwutuoha, I. P., Levy, D., Selic, B., and Chen, S. (2013). A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3):1302–1326.

Gencer, C. et al. (2021). Hazelcast jet: Low-latency stream processing at the 99.99 th percentile. arXiv preprint arXiv:2103.10169.

Gherardi, L., Brugali, D., and Comotti, D. (2012). A java vs. c++ performance evaluation: a 3d modeling benchmark. In SIMPAR 2012, pages 161–172. Springer.

Grant, S. and Voorhies, R. (2025). The OmpSs Programming Model.

Griebler, D. (2016). Domain-Specific Language & Support Tool for High-Level Stream Parallelism. PhD thesis, PPGCC - PUCRS, Porto Alegre, Brazil.

Griebler, D., Danelutto, M., Torquati, M., and Fernandes, L. G. (2017). SPar: A DSL for High-Level and Productive Stream Parallelism. PPL, 27(01):1740005.

Joshi, S. and Vadhiyar, S. (2025). Fthp-mpi: Towards providing replication-based fault tolerance in a fault-intolerant native mpi library. arXiv preprint arXiv:2504.09989.

Löff, J., Hoffmann, R. B., Pieper, R., Griebler, D., and Fernandes, L. G. (2022). DS-ParLib: A C++ Template Library for Distributed Stream Parallelism. International Journal of Parallel Programming, 50(5):454–485.

Manchana, R. (2015). Java virtual machine (jvm): Architecture, goals, and tuning options. International Journal of Scientific Research and Engineering Trends, 1(3):42–52.

Nielsen, F. (2016). Introduction to HPC with MPI for Data Science. Springer.

Pathirana, P., Jankowski, M., and Allen, S. (2015). Storm Applied: Strategies for real-time event processing.

Pop, A. and Cohen, A. (2013). Openstream: Expressiveness and data-flow compilation of openmp streaming programs. ACM Trans. Archit. Code Optim., 9(4).

Rocco, R. et al. (2024). Extending the legio resilience framework to handle critical process failures in mpi. In 2024 32nd PDP, pages 44–51. IEEE.

Shahzad, F. et al. (2018). Craft: A library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE TPDS, 30(3):501–514.

Yin, F. and Shi, F. (2022). A comparative survey of big data computing and hpc: From a parallel programming model to a cluster architecture. International Journal of Parallel Programming, 50(1):27–64.