Exploring Direct Convolution Performance on the Gemmini Accelerator

Caio Vieira; Arthur Lorenzon; Lucas Schnorr; Philippe Navaux; Antonio Carlos Beck

doi:10.5753/wscad.2020.14067

Caio Vieira UFRGS
Arthur Lorenzon Unipampa
Lucas Schnorr UFRGS
Philippe Navaux UFRGS
Antonio Carlos Beck UFRGS

DOI: https://doi.org/10.5753/wscad.2020.14067

Resumo

Algoritmos de Redes Neurais Convolucionais (CNN do inglês Convolutional Neural Network) tem se tornado uma solução recorrente para solucionar problemas de Visão Computacional. Estas redes empregam convoluções como principal bloco de construção, o que impacta a performance, pois convolução é um operação cara computacionalmente. Devido a sua importância nos algoritmos de CNN, este trabalho avalia o desempenho de convoluções no acelerador Gemmini e os compara com a execução em uma CPU convencional com cargas leves e pesadas de trabalho. Nós mostramos que o Gemmini pode atingir melhores tempos de execução e consumos energéticos até mesmo para pequenas convoluções e a diferença de desempenho cresce com o tamanho da convolução. Além disso, nós analisamos a frequência mínima necessária que o Gemmini deve ter para obter o mesmo tempo de execução de uma CPU e mostramos que o Gemmini é capaz de atingir o mesmo resultado mesmo quando trabalhando em frequências muito mais baixas.

Referências

Amid, A., Biancolin, D., Gonzalez, A., Grubb, D., Karandikar, S., Liew, H., Magyar, A., Mao, H., Ou, A., Pemberton, N., Rigge, P., Schmidt, C., Wright, J., Zhao, J., Shao, Y. S., Asanoviíc, K., and Nikoliíc, B. (2020). Chipyard: Integrated IEEE Micro, design, simulation, and implementation framework for custom socs. 40(4):10–21.

Asanoviíc, K., Avizienis, R., Bachrach, J., Beamer, S., Biancolin, D., Celio, C., Cook, H., Dabbelt, D., Hauser, J., Izraelevitz, A., Karandikar, S., Keller, B., Kim, D., Koenig, J., Lee, Y., Love, E., Maas, M., Magyar, A., Mao, H., Moreto, M., Ou, A., Patterson, D. A., Richards, B., Schmidt, C., Twigg, S., Vo, H., and Waterman, A. (2016). The rocket chip generator. Technical Report UCB/EECS-2016-17, EECS Department, University of California, Berkeley.

Chen, Y., Emer, J., and Sze, V. (2016). Eyeriss: A spatial architecture for energy-efcient dataow for convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pages 367–379.

Chen, Y., Krishna, T., Emer, J. S., and Sze, V. (2017). Eyeriss: An energy-efcient recongurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–138.

Cong, J. and Xiao, B. (2014). Minimizing computation in convolutional neural networks. In Wermter, S., Weber, C., Duch, W., Honkela, T., KoprinkovaHristova, P., Magg, S., Palm, G., and Villa, A. E. P., editors, Articial Neural Networks and Machine Learning – ICANN 2014, pages 281–290, Cham. Springer International Publishing.

Dukhan, M. (2019). The indirect convolution algorithm. CoRR, abs/1907.02129.

Genc, H., Haj-Ali, A., Iyer, V., Amid, A., Mao, H., Wright, J., Schmidt, C., Zhao, J., Ou, A., Banister, M., Shao, Y. S., Nikolic, B., Stoica, I., and Asanovic, K. (2019). Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures. ArXiv, abs/1911.09925.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). Mobilenets: Efcient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. B., Guadarrama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. CoRR, abs/1408.5093.

Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., Boyle, R., Cantin, P.-l., Chao, C., Clark, C., Coriell, J., Daley, M., Dau, M., Dean, J., Gelb, B., Ghaemmaghami, T. V., Gottipati, R., Gulland, W., Hagmann, R., Ho, C. R., Hogberg, D., Hu, J., Hundt, R., Hurt, D., Ibarz, J., Jaffey, A., Jaworski, A., Kaplan, A., Khaitan, H., Killebrew, D., Koch, A., Kumar, N., Lacy, S., Laudon, J., Law, J., Le, D., Leary, C., Liu, Z., Lucke, K., Lundin, A., MacKean, G., Maggiore, A., Mahony, M., Miller, K., Nagarajan, R., Narayanaswami, R., Ni, R., Nix, K., Norrie, T., Omernick, M., Penukonda, N., Phelps, A., Ross, J., Ross, M., Salek, A., Samadiani, E., Severn, C., Sizikov, G., Snelham, M., Souter, J., Steinberg, D., Swing, A., Tan, M., Thorson, G., Tian, B., Toma, H., Tuttle, E., Vasudevan, V., Walter, R., Wang, W., Wilcox, E., and Yoon, D. H. (2017). Indatacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA '17, page 1–12, New York, NY, USA. Association for Computing Machinery.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classication with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc.

Kung, H. T. (1982). Why systolic architectures? Computer, 15(1):37–46.

Kwon, H., Chatarasi, P., Pellauer, M., Parashar, A., Sarkar, V., and Krishna, T. (2019). Understanding reuse, performance, and hardware cost of dnn dataow: A data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '52, page 754–768, New York, NY, USA. Association for Computing Machinery.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521:436–44.

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.

NVIDIA (2020). Nvidia a100 tensor core gpu architecture.

opcm. Processor counter monitor. Available: https://github.com/opcm/pcm.

Vanhoucke, V., Senior, A., and Mao, M. Z. (2011). Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.

Waterman, A., Lee, Y., Patterson, D. A., and Asanoviíc, K. (2016). The risc-v instruction set manual, volume i: User-level isa, version 2.1. Technical Report UCB/EECS-2016-118, EECS Department, University of California, Berkeley.

Zhou, G., Zhou, J., and Lin, H. (2018). Research on nvidia deep learning accelerator. In 2018 12th IEEE International Conference on Anti-counterfeiting, Security, and Identication (ASID), pages 192–195.