HPC on a budget: on the CPU's impact on dense linear algebra computation with GPUs

Lucas Barros de Assis; Lucas Mello Schnorr

doi:10.5753/sscad.2025.16739

Lucas Barros de Assis UFRGS
Lucas Mello Schnorr UFRGS

DOI: https://doi.org/10.5753/sscad.2025.16739

Resumo

While GPU-equipped nodes dominate compute-intensive operations in high-performance computing, the role of host processors in overall system performance remains underexplored, particularly for hardware procurement decisions in resource-constrained environments. This study investigates the influence of the CPU microarchitecture on application performance using an identical modern GPU. We compare two systems with the same NVIDIA RTX4090 GPU but different CPUs: a 2016 Intel Xeon E5-2620v4 and a 2023 Intel Core i9-14900KF. We assess CPU impact on compute-bound workloads using dense Cholesky and LU factorizations from the Chameleon library assisted by the StarPU runtime system. Our findings demonstrate that: (1) CPU influence is negligible with modern GPU accelerators, even if some operations lack GPU implementations; (2) CPU-handled operations are sufficiently small to attenuate performance differences between processor generations; and (3) modern GPUs perform effectively in legacy hardware with minimal penalties. These results suggest selective GPU upgrades offer cost-effective performance improvements without complete system overhauls, providing valuable insights for academic and research institution procurement strategies.

Referências

Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S., and Tomov, S. (2010). Faster, Cheaper, Better – a Hybridization Methodology to Develop Linear Algebra Software for GPUs. In GPU Computing Gems, volume 2. Morgan Kaufmann.

Augonnet, C., Thibault, S., Namyst, R., and Wacrenier, P.-A. (2009). Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. In European Conference on Parallel Processing, pages 863–874. Springer.

Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Hérault, T., and Dongarra, J. J. (2013). Parsec: Exploiting heterogeneity to enhance scalability. Computing in Science & Engineering, 15(6):36–45.

Cambier, L., Qian, Y., and Darve, E. (2020). Tasktorrent: a lightweight distributed task-based runtime system in c++. In 2020 IEEE/ACM 3rd Annual Parallel Applications Workshop: Alternatives To MPI+ X (PAW-ATM), pages 16–26. IEEE.

Cardosi, P. and Bramas, B. (2025). Specx: a c++ task-based runtime system for heterogeneous distributed architectures. PeerJ Computer Science, 11:e2966.

Dongarra, J. and Keyes, D. (2024). The co-evolution of computational physics and highperformance computing. Nature Reviews Physics, 6(10):621–627.

Gamblin, T., LeGendre, M., Collette, M. R., Lee, G. L., Moody, A., de Supinski, B. R., and Futral, S. (2015). The spack package manager: bringing order to hpc software chaos. In Intl. Conf. for High Perf. Comp., Networking, Storage and Analysis.

Gonthier, M., Marchal, L., and Thibault, S. (2022). Memory-aware scheduling of tasks sharing data on multiple gpus with dynamic runtime systems. In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 694–704.

Klinkenberg, J., Samfass, P., Bader, M., Terboven, C., and Müller, M. S. (2020). Chameleon: reactive load balancing for hybrid mpi+ openmp task-parallel applications. Journal of Parallel and Distributed Computing, 138:55–64.

Li, A., Song, S. L., Chen, J., Li, J., Liu, X., Tallent, N. R., and Barker, K. J. (2019). Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Transactions on Parallel and Distributed Systems, 31(1):94–110.

Mishra, S., Chakravorty, D. K., Perez, L. M., Dang, F., Liu, H., and Witherden, F. D. (2024). Impact of memory bandwidth on the performance of accelerators. In Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, PEARC ’24, New York, NY, USA. ACM.

Navarro, A., Vilches, A., Corbera, F., and Asenjo, R. (2014). Strategies for maximizing utilization on multi-cpu and multi-gpu heterogeneous architectures. The Journal of Supercomputing, 70(2):756–771.

Pei, Y., Bosilca, G., and Dongarra, J. (2022). Sequential task flow runtime model improvements and limitations. In 2022 IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), pages 1–8. IEEE.

StarPU Project (2015). StarPU Handbook (version 1.4.8). Université de Bordeaux, CNRS, INRIA. Generated by Doxygen; GNU Free Documentation License.

Talib, M. A., Majzoub, S., Nasir, Q., and Jamal, D. (2021). A systematic literature review on hardware implementation of artificial intelligence algorithms. The Journal of Supercomputing, 77(2):1897–1938.

Tan, G., Shui, C., Wang, Y., Yu, X., and Yan, Y. (2021). Optimizing the linpack algorithm for large-scale pcie-based cpu-gpu heterogeneous systems. IEEE Transactions on Parallel and Distributed Systems, 32(9):2367–2380.

Tomov, S., Nath, R., Ltaief, H., and Dongarra, J. (2010). Dense linear algebra solvers for multicore with gpu accelerators. In 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pages 1–8. IEEE.

Valero-Lara, P., Kim, J., Hernandez, O., and Vetter, J. (2021). Openmp target task: Tasking and target offloading on heterogeneous systems. In European Conference on Parallel Processing, pages 445–455. Springer.

Xiao, W., Han, Z., Zhao, H., Peng, X., Zhang, Q., Yang, F., and Zhou, L. (2018). Scheduling cpu for gpu-based deep learning jobs. In Proceedings of the ACM Symposium on Cloud Computing, SoCC ’18, page 503, New York, NY, USA. ACM.