Integration and performance analysis of parallel RISC-V architectures

Casio P. Krebs; Guido Araujo; Lucas Wanner

doi:10.5753/sscad.2025.16672

Casio P. Krebs UNICAMP
Guido Araujo UNICAMP
Lucas Wanner UNICAMP

DOI: https://doi.org/10.5753/sscad.2025.16672

Resumo

Vector and matrix architectures accelerate workloads by exploiting data-level parallelism and reducing instruction overhead, but using them typically requires manual code changes. This work explores the Hwacha vector coprocessor and Gemmini matrix accelerator, alongside the BOOM superscalar RISC-V core, aiming to automate their activation. The SMR code rewriting tool was extended with libraries for data preparation, movement, and Hwacha/Gemmini activation for GEMV and GEMM. Integrated with Verilator, a simulation environment evaluated performance across seven Polybench kernels. Results show SMR can activate hardware accelerators without modifying the base code, enabling efficient acceleration more easily.

Referências

Amid, A., Biancolin, D., Gonzalez, A., Grubb, D., Karandikar, S., Liew, H., Magyar, A., Mao, H., Ou, A., Pemberton, N., Rigge, P., Schmidt, C., Wright, J., Zhao, J., Bachrach, J., Shao, S., Nikolić, B., and Asanović, K. (2020). Invited: Chipyard an integrated soc research and implementation environment. In 2020 57th ACM/IEEE Design Automation Conference (DAC), pages 1–6.

Chen, Y.-H., Emer, J., and Sze, V. (2016). Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. Proceedings of the 43rd Annual International Symposium on Computer Architecture (ISCA), pages 367–379.

Chen, Y.-H., Yang, T.-J., Emer, J., and Sze, V. (2019). Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 9(2):292–308.

Coppersmith, D. and Winograd, S. (1987). Matrix multiplication via arithmetic progressions. In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing (STOC), pages 1–6.

Dennard, R. H., Cai, J., and Kumar, A. (2007). A perspective on today’s scaling challenges and possible future directions. Solid-State Electronics, 51(4):518–525.

Espindola, V., Zago, L., Yviquel, H., and Araujo, G. (2023). Source matching and rewriting for mlir using string-based automata. ACM Trans. Archit. Code Optim., 20(2).

Genc, H., Kim, S., Amid, A., Haj-Ali, A., Iyer, V., Prakash, P., Zhao, J., Grubb, D., Liew, H., Mao, H., Ou, A., Schmidt, C., Steffl, S., Wright, J., Stoica, I., Ragan-Kelley, J., Asanovic, K., Nikolic, B., and Shao, Y. S. (2021). Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 769–774.

Gentleman, W. M. and Kung, H. T. (1982). Matrix triangularization by systolic arrays. In Real-Time Signal Processing IV, volume 298, pages 19–26. SPIE.

H.T.Kung, C. E. L. (1981). Systolic arrays (for vlsi). In Sparse Matrix Proceedings 1978, volume 1, pages 256–282. Society for industrial and applied mathematics Philadelphia, PA, USA.

H.T.Kung, H.-T. (1982). Why systolic architectures? Computer, 15(01):37–46.

Intel (2022). Intel Architecture Instruction Set Extensions and Future Features. Intel.

Jouppi, N. P., Kurian, G., Li, S., Ma, P., Nagarajan, R., Subramanian, S., et al. (2023). Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA ’23), pages 1–14.

Kung, H. T. and Song, S. W. (1981). A systolic 2-d convolution chip. Technical report, Carnegie Mellon University, Department of Computer Science.

Lee, Y., Ou, A., Schmidt, C., Karandikar, S., Mao, H., and Asanović, K. (2015). The hwacha microarchitecture manual, version 3.8.1. Technical report, EECS Department, University of California, Berkeley.

Liu, Z.-G., Whatmough, P. N., and Mattina, M. (2020). Systolic tensor array: An efficient structured-sparse gemm accelerator for mobile cnn inference. IEEE Computer Architecture Letters, 19(1):34–37.

Lomont, C. (2011). Introduction to intel advanced vector extensions. Intel White Paper, 23.

Moss, D. J. M., Krishnan, S., Nurvitadhi, E., Ratuszniak, P., Johnson, C., Sim, J., Mishra, A. K., Marr, D., Subhaschandra, S., and Leong, P. H. W. (2018). A customizable matrix multiplication framework for the intel harpv2 xeon+fpga platform: A deep learning case study. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 107–116.

Narayan, M. and Pouchet, L.-N. (2012). PolyBench/Fortran 1.0.

NVIDIA (2022). Nvidia hopper architecture in-depth. Technical report, NVIDIA.

NVIDIA (2024). Nvidia blackwell platform: Powering a new era of computing. Technical report, NVIDIA.

Oliphant, T. E. (2006). A Guide to NumPy, volume 1. Trelgol Publishing USA.

Peleg, A., Wilkie, S., and Weiser, U. (1997). Intel mmx for multimedia pcs. Communications of the ACM, 40(1):24–38.

Qin, E., Samajdar, A., Kwon, H., Nadella, V., Srinivasan, S., Das, D., Kaul, B., and Krishna, T. (2020). Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 58–70.

Schreiber, R. (1983). A systolic architecture for singular value decomposition. Technical report, Stanford University, Department of Computer Science.

Smith, J. E. (1984). Decoupled access/execute computer architectures. ACM Trans. Comput. Syst., 2(4):289–308.

Specification, S. T. (2018). Sifive tilelink specification.

Strassen, V. (1969). Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354–356.

UC Berkeley Architecture Research (2022). riscv-benchmarks (hwacha branch).

Verilator (2022). Verilator, Open-source tool for Verilog HDL simulation.

Williams, V. V. (2012). Multiplying matrices faster than coppersmith-winograd. In Proceedings of the 44th Annual ACM Symposium on Theory of Computing (STOC), pages 887–898.

Xianyi, Z. and Kroeker, M. (2020). Openblas: An optimized blas library. [link].

Zhao, J., Korpan, B., Gonzalez, A., and Asanovic, K. (2020). Sonicboom: The 3rd generation berkeley out-of-order machine. In Fourth Workshop on Computer Architecture Research with RISC-V.