Implementação de Tolerância a Falhas no Método Lattice Boltzmann para Execução Resiliente em Instâncias Efêmeras da AWS

Rafael Luis Sol Veit Vargas; Vanderlei Munhoz; Márcio Castro

doi:10.5753/sscad.2025.15722

Rafael Luis Sol Veit Vargas UFSC
Vanderlei Munhoz UFSC / University of Bordeaux / Inria
Márcio Castro UFSC

DOI: https://doi.org/10.5753/sscad.2025.15722

Resumo

Este artigo investiga o desempenho e o custo financeiro do uso de mecanismos de tolerância a falhas no método Lattice Boltzmann (LBM) executado em instâncias efêmeras (spot) da Amazon Web Services (AWS). Duas estratégias de recuperação são implementadas com a extensão ULFM da biblioteca MPI: (i) preemptiva, que suspende a aplicação até a alocação de uma nova instância usando persistência em disco; e (ii) não preemptiva, que permite a continuidade da execução com um número reduzido de instâncias usando persistência em memória. Os resultados indicam que a abordagem não preemptiva proporciona recuperação quase imediata, com um impacto no desempenho pós-falha. Já a abordagem preemptiva evita essa degradação, mas apresenta maior tempo de recuperação. Conclui-se que a estratégia não preemptiva com persistência em memória pode reduzir os custos financeiros em até 32%, mesmo com a ocorrência de falhas.

Referências

Amoon, M., El-Bahnasawy, N., Sadi, S., and Wagdi, M. (2019). On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems. Journal of Ambient Intelligence and Humanized Computing, 10(11):4567–4577.

Brum, R., Teylo, L., Arantes, L., and Sens, P. (2023). Ensuring Application Continuity with Fault Tolerance Techniques, pages 191–212. Springer International Publishing, Cham.

Calore, E., Gabbana, A., Schifano, S., and Tripiccione, R. (2017). Optimization of lattice boltzmann simulations on heterogeneous computers. The International Journal of High Performance Computing Applications, 33(1):124–139.

Chen, S. and Doolen, G. D. (1998). Lattice boltzmann method for fluid flows. Annual Review of Fluid Mechanics, 30(Volume 30, 1998):329–364.

Hargrove, P. H. and Duell, J. C. (2006). Berkeley lab checkpoint/restart (blcr) for linux clusters. Journal of Physics: Conference Series, 46(1):494.

Moody, A., Bronevetsky, G., Mohror, K., and Supinski, B. R. d. (2010). Design, modeling, and evaluation of a scalable multi-level checkpointing system. In SC ’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–11.

Munhoz, V., Bonfils, A., Castro, M., and Mendizabal, O. (2023). A performance comparison of hpc workloads on traditional and cloud-based hpc clusters. In 2023 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), pages 108–114.

Munhoz, V. and Castro, M. (2023). Enabling the execution of hpc applications on public clouds with hpc@cloud toolkit. Concurrency and Computation: Practice and Experience, 36.

Munhoz, V., Castro, M., and Mendizabal, O. (2022). Strategies for fault-tolerant tightly-coupled hpc workloads running on low-budget spot cloud infrastructures. In 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 263–272.

Netto, M., Calheiros, R., Rodrigues, E., Cunha, R., and Buyya, R. (2017). Hpc cloud for scientific and business applications: Taxonomy, vision, and research challenges. ACM Computing Surveys, 51.

Qu, C., Calheiros, R. N., and Buyya, R. (2016). A reliable and cost-efficient auto-scaling system for web applications using heterogeneous spot instances. Journal of Network and Computer Applications, 65:167–180.