Performance Evaluation of Dense Linear Algebra Kernels using Chameleon and StarPU on AWS

Vinicius Garcia Pinto; João V. F. Lima; Vanderlei Munhoz; Daniel Cordeiro; Emilio Francesquini; Márcio Castro

doi:10.5753/sscad.2024.244405

Vinicius Garcia Pinto FURG
João V. F. Lima UFSM
Vanderlei Munhoz UFSC
Daniel Cordeiro USP
Emilio Francesquini UFABC
Márcio Castro UFSC

DOI: https://doi.org/10.5753/sscad.2024.244405

Resumo

Due to recent advances and investments in cloud computing, public cloud providers now offer GPU-accelerated and compute-optimized Virtual Machine (VM) instances, allowing researchers to execute parallel workloads in virtual heterogeneous clusters in the cloud. This paper evaluates the performance and monetary costs of running dense linear algebra algorithms extracted from the Chameleon package implemented using StarPU on Amazon Elastic Compute Cloud (EC2) instances. We evaluated these metrics with a single powerful/costly instance with four NVIDIA GPUs (fat node) and with a cluster of five less powerful/cheaper instances with a single NVIDIA GPU in each node. Our results showed that most of the linear algebra algorithms achieved better performance and lower monetary costs on the fat node scenario even with one less GPU.

Referências

Agullo, E., Augonnet, C., Dongarra, J., Faverge, M., Ltaief, H., Thibault, S., and Tomov, S. (2011). QR factorization on a multicore node enhanced with multiple GPU accelerators. In IEEE International Parallel & Distributed Processing Symposium, pages 932–943. IEEE.

Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S., and Tomov, S. (2012). A hybridization methodology for high-performance linear algebra software for GPUs. In GPU Computing Gems Jade Edition, pages 473–484. Elsevier.

Astsatryan, H., Narsisian, W., and Costa, G. D. (2017). SaaS for energy efficient utilization of HPC resources of linear algebra calculations. Scalable Computing: Practice and Experience, 18(2):145–150.

Augonnet, C., Thibault, S., Namyst, R., and Wacrenier, P. (2010). StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187–198.

Beaumont, O., Collin, J.-A., Eyraud-Dubois, L., and Vérité, M. (2023). Data distribution schemes for dense linear algebra factorizations on any number of nodes. In IEEE International Parallel and Distributed Processing Symposium, pages 390–401. IEEE.

Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief, H., et al. (2011). Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA. In IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pages 1432–1441. IEEE.

Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. (2009). A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing, 35(1):38–53.

Chen, J., Tan, L., Wu, P., Tao, D., Li, H., Liang, X., Li, S., Ge, R., Bhuyan, L., and Chen, Z. (2016). GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing. In International Conference for High Performance Computing, Networking, Storage and Analysis, pages 667–677. IEEE.

Choi, J., Dongarra, J. J., Ostrouchov, L. S., Petitet, A. P., Walker, D. W., and Whaley, R. C. (1996). Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Scientific Programming, 5(3):173–184.

Choi, J., Dongarra, J. J., Pozo, R., and Walker, D. W. (1992). ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Symposium on the Frontiers of Massively Parallel Computation, pages 120–121. IEEE Computer Society.

Demmel, J. (1989). LAPACK: A portable linear algebra library for supercomputers. In IEEE Control Systems Society Workshop on Computer-Aided Control System Design, pages 1–7. IEEE.

Gamblin, T., LeGendre, M., Collette, M. R., Lee, G. L., Moody, A., de Supinski, B. R., and Futral, S. (2015). The Spack package manager: bringing order to HPC software chaos. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12.

Garcia Pinto, V., Mello Schnorr, L., Stanisic, L., Legrand, A., Thibault, S., and Danjean, V. (2018). A visual performance analysis framework for task-based parallel applications running on hybrid clusters. Concurrency and Computation: Practice and Experience, 30(18):e4472.

Gautier, T. and Lima, J. V. F. (2020). XKBlas: a high performance implementation of BLAS-3 kernels on multi-GPU server. In Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pages 1–8.

Gautier, T. and Lima, J. V. F. (2021). Evaluation of two topology-aware heuristics on level-3 blas library for multi-gpu platforms. In SC Workshops Supplementary Proceedings (SCWS), pages 12–22.

Igual, F. D., Chan, E., Quintana-Orti, E. S., Quintana-Orti, G., van de Geijn, R. A., and Zee, F. G. V. (2012). The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations. Journal of Parallel and Distributed Computing, 72(9):1134–1143. Accelerators for High-Performance Computing.

Lewis, A. G., Beall, J., Ganahl, M., Hauru, M., Mallick, S. B., and Vidal, G. (2022). Large-scale distributed linear algebra with tensor processing units. Proceedings of the National Academy of Sciences, 119(33):e2122762119.

Luszczek, P., Kurzak, J., and Dongarra, J. (2014). Looking back at dense linear algebra software. Journal of Parallel and Distributed Computing, 74(7):2548–2560.

Munhoz, V., Bonfils, A., Castro, M., and Mendizabal, O. (2023). A performance comparison of HPC workloads on traditional and cloud-based HPC clusters. In Workshop on Cloud Computing - IEEE International Symposium on Computer Architecture and High Performance Computing Workshops, pages 108–114, Porto Alegre, Brazil. IEEE Computer Society.

Munhoz, V. and Castro, M. (2024). Enabling the execution of HPC applications on public clouds with HPC@Cloud toolkit. Concurrency and Computation: Practice and Experience, 36(8):e7976.

Munhoz, V., Castro, M., and Mendizabal, O. (2022). Strategies for fault-tolerant tightly-coupled HPC workloads running on low-budget spot cloud infrastructures. In International Symposium on Computer Architecture and High Performance Computing, pages 263–272. IEEE Computer Society.

Netto, M. A. S., Calheiros, R. N., Rodrigues, E. R., Cunha, R. L. F., and Buyya, R. (2018). HPC cloud for scientific and business applications: Taxonomy, vision, and research challenges. ACM Comput. Surv., 51(1).

Poulson, J., Marker, B., Van de Geijn, R. A., Hammond, J. R., and Romero, N. A. (2013). Elemental: A new framework for distributed memory dense matrix computations. ACM Transactions on Mathematical Software, 39(2):1–24.

Shankar, V., Krauth, K., Vodrahalli, K., Pu, Q., Recht, B., Stoica, I., Ragan-Kelley, J., Jonas, E., and Venkataraman, S. (2020). Serverless linear algebra. In ACM Symposium on Cloud Computing, pages 281–295.

Thomas, A. and Kumar, A. (2018). A comparative evaluation of systems for scalable linear algebra-based analytics. VLDB Endowment, 11(13):2168–2182.

Tomov, S., Dongarra, J., and Baboulin, M. (2010). Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Computing, 36(5-6):232–240.

Wu, W., Bouteiller, A., Bosilca, G., Faverge, M., and Dongarra, J. (2015). Hierarchical DAG scheduling for hybrid distributed systems. In IEEE International Parallel and Distributed Processing Symposium, pages 156–165. IEEE.