Optimizing Neural Network Training through TensorFlow Profile Analysis in a Shared Memory System

  • Fabrício Gomes Vilasbôas Mackenzie Presbyterian University
  • Calebe Paula Bianchini Mackenzie Presbyterian University
  • Rodrigo Pasti Mackenzie Presbyterian University
  • Leandro Nunes Castro Mackenzie Presbyterian University

Resumo


On the one hand, Deep Neural Networks have emerged as a powerful tool for solving complex problems in image and text analysis. On the other, they are sophisticated learning machines that require deep programming and math skills to be understood and implemented. Therefore, most researchers employ toolboxes and frameworks to design and implement such architectures. This paper performs an execution analysis of TensorFlow, one of the most used deep network frameworks available, on a shared memory system. To do so, we chose a text classification problem based on tweets sentiment analysis. The focus of this work is to identify the best environment configuration for training neural networks on a shared memory system. We set five different configurations using environment variables to modify the TensorFlow execution behavior. The results on an Intel Xeon Platinum 8000 processors series show that the default environment configuration of the TensorFlow can increase the speed up to 5.8. But, fine-tuning this environment can improve the speedup at least 37%.

Referências

[1] Schmidhuber, J. (2014). Deep Learning in Neural Networks: An Overview, Neural Networks, 61(2015),pp. 85-117.

[2] Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. (2014). A Convolutional Neural Network for Modelling Sentences, arXiv.org.

[3] Abadi, Martı́n, et al.(2016),Tensorflow: A System for Large-Scale Machine Learning, Proc. of the 12th USENIX Symposium on Operating Systems Design andImplementation (OSDI’16), pp. 265-283.

[4] Walt, Stéfan van der, S. Chris Colbert, and Gael Varoquaux. (2011), The NumPy Array:A Structure for Efficient Numerical Computation,Computing in Science & Engineering,13(2), pp.22-30.

[5] Gropp, William D., et al. (1999), Using MPI: portable parallel programming with the message-passing interface. 2ndEdition, MIT Press, 1999.

[6] Vishnu, Abhinav, Charles Siegel, and Jeffrey Daily. (2016), Distributed TensorFlow with MPI,arXiv preprint arXiv:1603.02339 (2016)

[7] Chapman, Barbara, Gabriele Jost, and Ruud Van Der Pas. (2008), UsingOpenMP: Portable Shared Memory Parallel Programming. MIT Press.

[8] Kirk, David B., and W. Hwu Wen-Mei. (2008), Programming Massively Parallel Processors: A Hands-On Approach,Morgan Kaufmann.

[9] Cummins,Chris,et al. (2017), Deep Learning for Compilers,Institute for Computing Systems Architecture.

[10] Zhang, X., Q. Wang, and W. Saar. (2017), OpenBLAS: An optimized BLAS library, http://www.openblas.net/.

[11] Anderson, Edward, et al. (1990), LAPACK: A Portable Linear Algebra Library for HighPerformance Computers,Proceedings of the 1990 ACM/IEEE Conference on Supercomputing,IEEE Computer Society Press, pp. 2-11.

[12] Wang, Endong, et al. (2014), Intel Math Kernel Library, High-Performance Computing on the Intel Xeon Phi. Springer, Cham, pp.167-188.

[13] Mikolov, T.; Chen, K.; Corrado, G.; Dean, J., Efficient Estimation of Word Representations in Vector Space, arXiv preprint arXiv:1301.3781

[14] Jurafsky, D.; Martin, J. H. (2009). Speech and Language Processing, Prentice Hall, 2nd Edition

[15] Hager, Georg, and Gerhard Wellein. (2010), Introduction to High Performance Computing For Scientists And Engineers. CRC Press

[16] Pang, B.; Lee, L.(2008), Opinion Mining and Sentiment Analysis, Foundations and Trendsin Information Retrieval,2(1–2), pp 1-135. http://dx.doi.org/10.1561/1500000011

[17] Severyn, A.; Moschitti, A. (2015), Twitter Sentiment Analysis with Deep Convolutional Neural Networks, Proc. of the 38th Int.ACM SIGIR Conf.on Research and Development in Information Retrieval, pp. 959-962.

[18] Lai, S.; Xu, L.; Liu, K.; Zhao, J. (2015). Recurrent Convolutional Neural Networks for Text Classification, Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2267-2273 [19] Snyder, Lawrence. (1998), A Taxonomy of Synchronous Parallel Machines, Technical Report,Washington Univ.Seattle,Dept of Computer Science.

[20] TensorFlow. Perfomance. Acessed at Jul/19. https://www.tensorflow.org/guide/performance/overview#tunin

[21] Anaconda. Anaconda Distribution. https://docs.anaconda.com/anaconda/ Accessed at Jul/19.

[22] Thorsten Kurth, Jian Zhang, Nadathur Satish, Evan Racah, Ioannis Mitliagkas, Md., Mostofa Ali Patwary, Tareq Malas, Narayanan Sundaram, Wahid Bhimji, Mikhail Smorkalov, Jack Deslippe, Mikhail Shiryaev, Srinivas Sridharan, Prabhat, and Pradeep Dubey. 2017. Deep learning at 15PF: supervised and semi-supervised classification for scientific data. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’17). ACM, New York, NY, USA, Article 7, 11 pages. DOI: https://doi.org/10.1145/3126908.3126916

[23] Ammar Ahmad Awan, Karthik Vadambacheri Manian, Ching-Hsiang Chu, Hari Subramoni, Dhabaleswar K. Panda, Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?, Parallel Computing, Volume 85, 2019, Pages 141-152, ISSN 0167-8191, https://doi.org/10.1016/j.parco.2019.03.005.

[24] Durbha SS, Kurte KR, Bhangale U. Semantics and high performance computing driven approaches for enhanced exploitation of earth observation (EO) data: state of the art. Proc Natl Acad Sci India Sect A Phys Sci. 2017 87(4):519-539. https://doi.org/10.1007/s40010-017-0432-z

[25] OpenMP Architecture Review Board. OpenMP Application Programming Interface. V5.0. Accessed at Jul/19. https://www.openmp.org/wp-content/uploads/OpenMPAPI-Specification-5.0.pdf

[26] Intel. Supported Environment Variables. Accessed at Jul/19. https://software.intel.com/en-us/cpp-compiler-developer-guide-and-referencesupported-environment-variables

[27] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc. 2008. San Francisco, CA, USA.

[28] Kim, Heehoon, et al. Performance analysis of CNN frameworks for GPUs. 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2017.

[29] Kochura, Yuriy, et al. Performance analysis of open source machine learning frameworks for various parameters in single-threaded and multi-threaded modes. Conference on Computer Science and Information Technologies. Springer, Cham, 2017.

[30] S. A. Mojumder et al., Profiling DNN Workloads on a Volta-based DGX-1 System. 2018 IEEE International Symposium on Workload Characterization (IISWC), Raleigh, NC, 2018, pp. 122-133. doi: 10.1109/IISWC.2018.8573521
Publicado
12/11/2019
VILASBÔAS, Fabrício Gomes; BIANCHINI, Calebe Paula; PASTI, Rodrigo; CASTRO, Leandro Nunes. Optimizing Neural Network Training through TensorFlow Profile Analysis in a Shared Memory System. In: WORKSHOP DE COMPUTAÇÃO HETEROGÊNEA - SIMPÓSIO EM SISTEMAS COMPUTACIONAIS DE ALTO DESEMPENHO (SSCAD), 20. , 2019, Campo Grande. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 73-83. DOI: https://doi.org/10.5753/wscad_estendido.2019.8701.