Uma Implementação MPI Tolerante a Falhas do Algoritmo Paralelo de Ordenação Quickmerge

  • Felipe Xavier Universidade Tecnológica Federal do Paraná
  • Edson Tavares Camargo Universidade Tecnológica Federal do Paraná
  • Elias Duarte Jr

Abstract


Quickmerge is a parallel sorting algorithm that combines the efficient Quicksort strategy with merge operations of subsets which are created from key elements called pivots. Two versions of the Quickmerge algorithm that run on a hypercube have been found in the literature, but none can deal with process failures at runtime. In this work we present a fault-tolerant MPI implementation of both Quickmerge and Modified Quickmerge based on a virtual topology called VCube. The proposed algorithms work correctly even if all processes fail but one remains correct. The algorithms are compared to a fault-tolerant implementation of the Hyperquicksort parallel sorting algorithm also for the VCube. Results show the efficiency of the implementation to sort up to 1 billion integers in faulty and fault-free scenarios.

References

Ashraf, R. A., Hukerikar, S., and Engelmann, C. (2018). Shrink or substitute: Handling process failures in HPC systems using in-situ recovery. In Euromicro Int. Conference on Parallel, Distributed and Network-based Processing (PDP), pages 178–185.

Bland, W., Bouteiller, A., Hérault, T., Bosilca, G., and Dongarra, J. (2013). Post-failure recovery of MPI communication capability: Design and rationale. International Journal of HPC Applications, 27(3):244–254.

Camargo, E. T. and Duarte, Jr., E. (2016). Uma implementação MPI tolerante a falhas do algoritmo hyperquicksort. In Simp. Brasileiro de Redes de Computadores e Sistemas Distribuı́dos (SBRC) 2016 - Workshop de Tolerância a Falhas.

Camargo, E. T. d. and Duarte, Jr., E. P. (2017). Minicurso v: Técnicas para a construção de sistemas MPI tolerantes a falhas. In Minicurso do Simpósio de Sistemas Computacionais de Alto Desempenho (WSCAD) 2017.

Camargo, E. T. d. and Duarte, Jr., E. P. (2018). Running resilient MPI applications on a dynamic group of recommended processes. J. of the Brazilian Comp. Society, 24(1):5.

Duarte, E. P., Bona, L. C. E., and Ruoso, V. K. (2014). Vcube: A provably scalable distributed diagnosis algorithm. In 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pages 17–22.

Evans, D. (1990). A parallel sorting - merging algorithm for tightly coupled multiprocessors. Parallel Computing, 14(1):111 – 121.

Forum, M. (2015). Document for a standard message-passing interface 3.1. Technical report, University of Tennessee, http://www.mpi-forum.org/docs/mpi-3.1.

Forum, M. (2019). MPI forum website. http://mpi-forum.org/. Em 23/07/2019.

Freiling, F. C., Guerraoui, R., and Kuznetsov, P. (2011). The failure detector abstraction. ACM Comput. Surv., 43(2):9:1–9:40.

Gamell, M., Teranishi, K., Mayo, J., Kolla, H., Heroux, M. A., Chen, J., and Parashar, M. (2017). Modeling and simulating multiple failure masking enabled by local recovery for stencil-based applications at extreme scales. IEEE Transactions on Parallel and Distributed Systems, 28(10):2881–2895.

MPI-Forum (2019). User-level failure mitigation. https://bitbucket.org/icldistcomp/ulfm/. Acessado em 26/02/2019.

mpich.org (2017). High-performance portable mpi. http://www.mpich.org/. Acessado em 23/07/2019.

Parhami, B. (1999). Introduction to Parallel Processing: Algorithms and Architectures. Kluwer Academic Publishers, Norwell, MA, USA.

Quinn, M. J. (1988). Parallel sorting algorithms for tightly coupled multiprocessors. Parallel Computing, 6(3):349 – 357.

Quinn, M. J. (1989). Analysis and benchmarking of two parallel sorting algorithms: Hyperquicksort and quickmerge. BIT Numerical Mathematics, 29(2):239–250.

Shahzad, F., Thies, J., Kreutzer, M., Zeiser, T., Hager, G., and Wellein, G. (2019). Craft: A library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE Transactions on Parallel and Distributed Systems, 30(3):501–514.

Shi, H. and Schaeffer, J. (1992). Parallel sorting by regular sampling. Journal of Parallel and Distributed Computing, 14(4):361 – 372.

Träff, J. L. (2018) Parallel quicksort without pairwise element exchange. CoRR, abs/1804.07494.

Wagar, B. (1987). Hyperquicksort: A fast sorting algorithm for hypercubes. Hypercube Multiprocessors, 1987:292–299.
Published
2019-11-08
XAVIER, Felipe; CAMARGO, Edson Tavares; DUARTE JR, Elias. Uma Implementação MPI Tolerante a Falhas do Algoritmo Paralelo de Ordenação Quickmerge. In: SYMPOSIUM ON HIGH PERFORMANCE COMPUTING SYSTEMS (SSCAD), 20. , 2019, Campo Grande. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 276-287. DOI: https://doi.org/10.5753/wscad.2019.8675.