Modelos de Predição do Tempo de Jobs Aplicados a um Ambiente de Produção de Alto Desempenho

Miguel de Lima; Bernardo Gallo; Luciano Andrade; Felipe A. Portella; Paulo J. B. Estrela; Renzo Q. Malini; Alan L. Nunes; José Viterbo; Lúcia M. A. Drummond

doi:10.5753/sscad.2024.244537

Miguel de Lima UFF
Bernardo Gallo UFF
Luciano Andrade UFF
Felipe A. Portella PETROBRAS
Paulo J. B. Estrela PETROBRAS
Renzo Q. Malini PETROBRAS
Alan L. Nunes UFF
José Viterbo UFF
Lúcia M. A. Drummond UFF

DOI: https://doi.org/10.5753/sscad.2024.244537

Resumo

Este artigo tem como objetivo avaliar o impacto da utilização do tempo de execução de jobs, previstos pelos modelos de aprendizado de máquina J48, Linear Regression e Random Forest, no escalonamento em sistemas computacionais de alto desempenho. Os tempos previstos por esses modelos foram usados pela política SJF (Shortest Job First) em uma simulação de escalonamento baseada em um conjunto de milhares de jobs de aplicações reais de alto desempenho que foram executados em um ambiente de produção da Petrobras. As métricas de desempenho de escalonamento throughput e tempo médio de espera foram examinadas adicionalmente às tradicionais métricas teóricas de modelos preditores. Demonstramos que o efeito prático das predições pode divergir do resultado teórico dos preditores, destacando a importância de avaliações empíricas para a otimização do escalonamento de jobs.

Referências

Coats, K. H. (1982). Reservoir Simulation: State of the Art. Journal of Petroleum Technology, 34(8):1633–1642.

Feitelson, D. and Weil, A. (1998). Utilization and Predictability in Scheduling the IBM SP2 with Backfilling. In First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, pages 542–546.

Gaussier, E., Lelong, J., Reis, V., and Trystram, D. (2018). Online Tuning of EASY-Backfilling using Queue Reordering Policies. IEEE Transactions on Parallel and Distributed Systems, 29(10):2304–2316.

Hall, M., Frank, E., Holmes, G., et al. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explor. Newsl., 11(1):10–18.

Kim, S., Sim, A., Wu, K., Byna, S., Son, Y., and Eom, H. (2020). Towards HPC I/O Performance Prediction through Large-scale Log Analysis. In 29th International Symposium on High-Performance Parallel and Distributed Computing, pages 77–88. ACM.

Kuchnik, M., Park, J. W., Cranor, C., Moore, E., DeBardeleben, N., and Amvrosiadis, G. (2019). This is why ML-driven cluster scheduling remains widely impractical. Technical report, Carnegie Mellon University.

Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling, volume 26. Springer.

Lifka, D. A. (1998). An extensible job scheduling system for massively parallel processor architectures. Illinois Institute of Technology.

Lopes, R. V. and Menascé, D. (2016). A Taxonomy of Job Scheduling on Distributed Computing Systems. IEEE Trans. on Parallel and Distrib. Systems, 27(12):3412–3428.

Naghshnejad, M. and Singhal, M. (2018). Adaptive Online Runtime Prediction to Improve HPC Applications Latency in Cloud. In 11th International Conference on Cloud Computing, pages 762–769. IEEE.

Nichols, D., Marathe, A., Shoga, K., Gamblin, T., and Bhatele, A. (2022). Resource Utilization Aware Job Scheduling to Mitigate Performance Variability. In IEEE International Parallel and Distributed Processing Symposium, pages 335–345.

Nunes, A. L., Portella, F., Estrela, P., Malini, R., Lopes, B., Bittencourt, A., Leite, G., Coutinho, G., and Drummond, L. (2023). Prediction of Reservoir Simulation Jobs Times Using a Real-World SLURM Log. In Anais do XXIV Simpósio em Sistemas Computacionais de Alto Desempenho, pages 49–60. SBC.

Pinedo, M. L. (2016). Scheduling: Theory, Algorithms, and Systems. Springer.

Portella, F., Buchaca, D., Rodrigues, J. R., and Berral, J. L. (2022). TunaOil: A tuning algorithm strategy for reservoir simulation workloads. Journal of Comput. Science, 63.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers.

Reuther, A., Byun, C., Arcand, W., Bestor, D., Bergeron, B., Hubbell, M., Jones, M., Michaleas, P., Prout, A., Rosa, A., and Kepner, J. (2018). Scalable system scheduling for HPC and big data. Journal of Parallel and Distributed Computing, 111:76–92.

Simakov, N. A., Innus, M. D., Jones, M. D., DeLeon, R. L., White, J. P., Gallo, S. M., Patra, A. K., and Furlani, T. R. (2018). A Slurm Simulator: Implementation and Parametric Analysis. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, pages 197–217. Springer.

Tanash, M., Dunn, B., Andresen, D., Hsu, W., Yang, H., and Okanlawon, A. (2019). Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning. In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines, pages 1–8. ACM.

Tsafrir, D., Etsion, Y., and Feitelson, D. G. (2007). Backfilling Using System-Generated Predictions Rather than User Runtime Estimates. IEEE Transactions on Parallel and Distributed Systems, 18(6):789–803.

Wang, H., Dai, Y.-Q., Yu, J., and Dong, Y. (2021). Predicting running time of aerodynamic jobs in HPC system by combining supervised and unsupervised learning method. Advances in Aerodynamics, 3(1).

Witt, C., Bux, M., Gusew, W., and Leser, U. (2019). Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Information Systems, 82:33–52.

Yang, W., Liao, X., Dong, D., and Yu, J. (2023). Exploring job running path to predict runtime on multiple production supercomputers. Journal of Parallel and Distributed Computing, 175(C):109—-120.

Yoo, A. B., Jette, M. A., and Grondona, M. (2003). SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing, pages 44–60. Springer.