Otimização de Fábricas de IA Sustentáveis por Compartilhamento de GPU

Matheus M. Costa; Sandro Rigo; Carla Osthoff; Silvio Rizzi; Philippe O. A. Navaux; Arthur Lorenzon

doi:10.5753/sscad.2025.16724

Matheus M. Costa UFRGS
Sandro Rigo UNICAMP
Carla Osthoff LNCC
Silvio Rizzi ANL
Philippe O. A. Navaux UFRGS
Arthur Lorenzon UFRGS

DOI: https://doi.org/10.5753/sscad.2025.16724

Resumo

O uso de Inteligência Artificial (IA) em larga escala tem crescido rapidamente, aumentando a pressão sobre infraestruturas de computação de alto desempenho (HPC). Em particular, com o surgimento de AI Factories, é necessária a adoção de estratégias eficientes de alocação de recursos para maximizar o throughput e reduzir o consumo energético. Nesse cenário, o compartilhamento de GPUs surge como uma alternativa promissora para melhorar a utilização dos aceleradores e equilibrar desempenho e eficiência energética. Este artigo investiga o impacto do compartilhamento da GPU na eficiência energética durante a inferência de modelos de IA usando o acelerador Intel Data Center GPU Max 1550 no supercomputador de classe exascale Aurora. Avaliamos quatro modelos distintos de IA de diferentes domínios de aplicação, como visão computacional, processamento de linguagem natural e geração de texto. Os resultados mostram que a co-localização de aplicações com perfis de uso de recursos complementares pode reduzir o tempo de execução em até 50% e o consumo de energia em até 43%. Por fim, demonstramos que a alocação de tarefas concorrentes em GPU, baseada na caracterização prévia das aplicações, é uma estratégia promissora para maximizar a eficiência em ambientes HPC.

Referências

Adufu, T., Ha, J., and Kim, Y. (2024). Exploring the Diversity of Multiple Job Deployments over GPUs for Efficient Resource Sharing. In ICOIN, pages 777–782. IEEE.

Chitty-Venkata, K. T., Raskar, S., Kale, B., Ferdaus, F., Tanikanti, A., Raffenetti, K., Taylor, V., Emani, M., and Vishwanath, V. (2024). LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators. In SC24-W, pages 1362–1379.

Costa, M. M., Navaux, P. O. A., Rizzi, S., and Lorenzon, A. F. (2025a). Optimizing QMCPACK Energy-Efficiency on Aurora via GPU Sharing. In 12th Latin American High Performance Computing Conference, CARLA 2025.

Costa, M. M., Rizzi, S., Navaux, P. O. A., and Lorenzon, A. F. (2025b). One GPU, Many Ranks: Enabling Performance and Energy-Efficient In-Transit Visualization via Resource Sharing. In Proceedings of the 54th International Conference on Parallel Processing, New York, NY, USA. Association for Computing Machinery.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.

Eastep, J., Sylvester, S., Cantalupo, C., Geltz, B., Ardanaz, F., Al-Rawi, A., Livingston, K., Keceli, F., Maiterth, M., and Jana, S. (2017). Global Extensible Open Power Manager: A Vehicle for HPC Community Collaboration on Co-Designed Energy Management Solutions. In Kunkel, J. M., Yokota, R., Balaji, P., and Keyes, D., editors, High Performance Computing, pages 394–412, Cham. Springer International Publishing.

Ferdaus, F., Wu, X., Taylor, V., Lan, Z., Shanmugavelu, S., Vishwanath, V., and Papka, M. E. (2025). Evaluating Energy Efficiency of Ai Accelerators Using Two Mlperf Benchmarks. In 2025 IEEE 25th CCGrid, pages 549–558.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Residual Learning for Image Recognition. In CVPR, pages 770–778.

Hestness, J., Keckler, S. W., and Wood, D. A. (2015). GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors. In IISWC, pages 87–97. Institute of Electrical and Electronics Engineers Inc.

Intel Corporation (2025). Advanced Topics — Intel® oneAPI GPU Optimization Guide. [link].

John, C. M., Nassyr, S., Penke, C., and Herten, A. (2024). Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML. In SC24-W, pages 1164–1176.

Kamath, A. K., Prabhu, R., Mohan, J., Peter, S., Ramjee, R., and Panwar, A. (2025). POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference. In ASPLOS, ASPLOS ’25, page 897–912, New York, NY, USA. ACM.

Kreuzberger, D., Kühl, N., and Hirschl, S. (2023). Machine Learning Operations (MLOps): Overview, Definition, and Architecture. IEEE Access, 11:31866–31879.

Lee, M., Seong, S., Kang, M., Lee, J., Na, G.-J., Chun, I.-G., Nikolopoulos, D., and Hong, C.-H. (2024). ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments . In SC24, pages 1–14, Los Alamitos, CA, USA. IEEE Computer Society.

Navaux, P. O. A., Lorenzon, A. F., and Serpa, M. D. S. (2023). Challenges in High-Performance Computing. Journal of the Brazilian Computer Society, 29(1):51–62.

NVIDIA (2020). NVIDIA A100 Tensor Core GPU Architecture. [link].

NVIDIA (2023). NVIDIA Multi-Process Service Overview. [link].

Otterness, N. and Anderson, J. H. (2021). Exploring AMD GPU Scheduling Details by Experimenting With “Worst Practices”. In Proceedings of the 29th International Conference on Real-Time Networks and Systems, RTNS ’21, page 24–34, New York, NY, USA. ACM.

Pratheek, B., Jawalkar, N., and Basu, A. (2021). Improving GPU Multi-tenancy with Page Walk Stealing. In HPCA, pages 626–639.

Reddi, V. J., Cheng, C., Kanter, D., Mattson, P., Schmuelling, G., Wu, C.-J., Anderson, B., Breughe, M., Charlebois, M., Chou, W., Chukka, R., Coleman, C., Davis, S., Deng, P., Diamos, G., Duke, J., Fick, D., Gardner, J. S., Hubara, I., Idgunji, S., Jablin, T. B., Jiao, J., John, T. S., Kanwar, P., Lee, D., Liao, J., Lokhmotov, A., Massa, F., Meng, P., Micikevicius, P., Osborne, C., Pekhimenko, G., Rajan, A. T. R., Sequeira, D., Sirasao, A., Sun, F., Tang, H., Thomson, M., Wei, F., Wu, E., Xu, L., Yamada, K., Yu, B., Yuan, G., Zhong, A., Zhang, P., and Zhou, Y. (2020). MLPerf Inference Benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 446–459.

Sethia, A. and Mahlke, S. (2014). Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. In IEEE/ACM Int. Symposium on Microarchitecture, pages 647–658.

Silvano, C., Ielmini, D., Ferrandi, F., Fiorin, L., Curzel, S., Benini, L., Conti, F., Garofalo, A., Zambelli, C., Calore, E., Schifano, S., Palesi, M., Ascia, G., Patti, D., Petra, N., De Caro, D., Lavagno, L., Urso, T., Cardellini, V., Cardarilli, G. C., Birke, R., and Perri, S. (2025). A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms. ACM Comput. Surv., 57(11).

Tramm, John, Romano, Paul, Shriwise, Patrick, Lund, Amanda, Doerfert, Johannes, Steinbrecher, Patrick, Siegel, Andrew, and Ridley, Gavin (2024). Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs. EPJ Web Conf., 302:04010.

Zhang, B., Li, S., and Li, Z. (2024). MIGER: Integrating Multi-Instance GPU and MultiProcess Service for Deep Learning Clusters. In Proceedings of the 53rd International Conference on Parallel Processing, page 504–513, New York, NY, USA. ACM.