Optimized Parallel Reduction for Regular and Irregular Segments on GPU

  • Michel B. Cordeiro UFPR
  • Wagner M. Nunan Zola UFPR

Abstract


Reduction is an operation that combines all the elements of a collection by applying a binary operation, such as sum, maximum, or minimum, to all the elements to obtain a single resulting value. This work aims to investigate implementation strategies for segmented reduction on GPUs. Existing techniques for segmented reduction often show consistent performance but are relatively inefficient in absolute terms. These techniques are frequently optimized for specific workloads and, as a result, may exhibit degraded performance with certain input data, especially when segments have varying sizes. The algorithm presented in this paper employs three different strategies to handle segments of varying sizes and has the capability to select the best strategy at runtime to optimize the processing of each segment based on its size. The proposed algorithm delivers consistent performance, achieving up to 49.47 times speedup reducing variable-sized segments and up to 12.62 times acceleration working with regular-sized segments, compared to the segmented reduction algorithm from top GPU libraries. Additionally, this paper also explores strategies to accelerate non-segmented reduction, resulting in up to 1.77 times improvement compared to other parallel GPU implementations.

References

Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.-H., and Skadron, K. (2009). Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC), pages 44–54. IEEE.

Cordeiro, M. and Zola, W. (2023). KNN paralelo em GPU para grandes volumes de dados com agregação de consultas. In Anais do XXIV Simpósio em Sistemas Computacionais de Alto Desempenho, pages 253–264, Porto Alegre, RS, Brasil. SBC.

Ferraz, S., Dias, V., Teixeira, C. H., Parthasarathy, S., Teodoro, G., and Meira, W. (2024). DuMato: An efficient warp-centric subgraph enumeration system for GPU. Journal of Parallel and Distributed Computing, 191:104903.

Gupta, K., Stuart, J. A., and Owens, J. D. (2012). A study of persistent threads style GPU programming for GPGPU workloads. In 2012 Innovative Parallel Computing (InPar).

Harris, M. (2007). Optimizing parallel reduction in CUDA. Nvidia developer technology.

Johnson, J., Douze, M., and Jégou, H. (2019). Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.

Jradi, W. A. R., do Nascimento, H. A. D., and Martins, W. S. (2018). A fast and generic gpu-based parallel reduction implementation. In 2018 Symposium on High Performance Computing Systems (WSCAD), pages 16–22. IEEE.

Larsen, R. W. and Henriksen, T. (2017). Strategies for regular segmented reductions on gpu. In Proceedings of the 6th ACM SIGPLAN International Workshop on Functional High-Performance Computing, pages 42–52.

Luitjens, J. (2014). Faster parallel reductions on kepler. [link].

Meyer, B., Pozo, A., and Nunan Zola, W. M. (2021). Warp-centric K-nearest neighbor graphs construction on GPU. In 50th International Conference on Parallel Processing Workshop, ICPP Workshops ’21. Association for Computing Machinery.

NVIDIA (2024). CUB. [link].

Thrust (2024). Thrust: The C++ Parallel Algorithms Library. [link].

Warashina, T., Aoyama, K., Sawada, H., and Hattori, T. (2014). Efficient k-nearest neighbor graph construction using mapreduce for large-scale data sets. IEICE TRANSACTIONS on Information and Systems, 97(12):3142–3154.

Zola, W. M. N. and De Bona, L. C. E. (2012). Parallel speculative encryption of multiple AES contexts on GPUs. In 2012 Innovative Parallel Computing (InPar). IEEE.
Published
2024-10-23
CORDEIRO, Michel B.; ZOLA, Wagner M. Nunan. Optimized Parallel Reduction for Regular and Irregular Segments on GPU. In: BRAZILIAN SYMPOSIUM ON HIGH PERFORMANCE COMPUTING SYSTEMS (SSCAD), 25. , 2024, São Carlos/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 192-203. DOI: https://doi.org/10.5753/sscad.2024.244797.