Review Articles

Research on three-step accelerated gradient algorithm in deep learning

Yongqiang Lian ,

KLATASDS-MOE, School of Statistics, East China Normal University, Shanghai, People’s Republic of China

Yincai Tang ,

KLATASDS-MOE, School of Statistics, East China Normal University, Shanghai, People’s Republic of China

Shirong Zhou

KLATASDS-MOE, School of Statistics, East China Normal University, Shanghai, People’s Republic of China

Pages 40-57 | Received 04 Jun. 2020, Accepted 02 Nov. 2020, Published online: 23 Nov. 2020,
  • Abstract
  • Full Article
  • References
  • Citations

Gradient descent (GD) algorithm is the widely used optimisation method in training machine learning and deep learning models. In this paper, based on GD, Polyak’s momentum (PM), and Nesterov accelerated gradient (NAG), we give the convergence of the algorithms from an initial value to the optimal value of an objective function in simple quadratic form. Based on the convergence property of the quadratic function, two sister sequences of NAG’s iteration and parallel tangent methods in neural networks, the three-step accelerated gradient (TAG) algorithm is proposed, which has three sequences other than two sister sequences. To illustrate the performance of this algorithm, we compare the proposed algorithm with the three other algorithms in quadratic function, high-dimensional quadratic functions, and nonquadratic function. Then we consider to combine the TAG algorithm to the backpropagation algorithm and the stochastic gradient descent algorithm in deep learning. For convenientlyfacilitate the proposed algorithms, we rewite the R package ‘neuralnet’ and extend it to ‘supneuralnet’. All kinds of deep learning

algorithms in this paper are included in ‘supneuralnet’ package. Finally, we show our algorithms are superior to other algorithms in four case studies.


  • Andrei, N. (2008). An unconstrained optimization test functions collection. Advanced Modeling and Optimization, 10(1), 147161
  • Anthony, M., & Bartlett, P. L. (2009). Neural network learning: Theoretical foundations. Cambridge University Press
  • Armijo, L. (1966). Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of Mathematics, 16(1), 13. 
  • Berger, J., & He, X. (2019). Statistics at a crossroads: Challenges and opportunities in the data science era.
  • Bishop, C. M. (Ed.). (2006). Pattern recognition and machine learning. Springer
  • Blum, A. L., & Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Neural Networks, 5(1), 117127. 
  • Borchers, H. W. (2019). pracma: Practical numerical math functions [Computer software manual]. (R package version 2.2.9). 
  • Chen, K. K. (2016). The upper bound on knots in neural networks. arXiv:1611.09448
  • Churchland, P. S., & Sejnowski, T. J. (2016). The computational brain. The MIT Press
  • Deng, Q., Cheng, Y., & Lan, G. (2018). Optimal adaptive and accelerated stochastic gradient descent. arXiv:1810.00553
  • Flach, P. (2012). Machine learning: The art and science of algorithms that make sense of data. Cambridge University Press
  • Fritsch, S., Guenther, F., Wright, M. N., Suling, M., & Mueller, S. M. (2019). Neuralnet: Training of neural networks [Computer software manual]. (R package version 1.44.2). 
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
  • Gordon, M. (2018). My attempt to understand the backpropagation algorithm for training neural networks (Tech. Rep.).
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer
  • Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The lasso and generalizations. CRC Press.
  • Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. John Wiley & Sons Inc
  • Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 15271554. 
  • Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504507. 
  • Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251257. 
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2015). An introduction to statistical learning: With applications in R. Springer
  • Karatzoglou, A., & Meyer, D. (2006). Support vector machines in R. Journal of Statistical Software, 15(9), 128. 
  • Kawaguchi, K. (2016). Deep learning without poor local minima. arXiv:1605.07110.
  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM
  • Lan, G. (2009). Convex optimization under inexact first-order information [Unpublished doctoral dissertation]. 
  • Lan, G. (2012). An optimal method for stochastic composite optimization. Mathematical Programming, 133(1), 365397. 
  • Le, Q. V. (2015). A tutorial on deep learning (Research Report No. 06.3). Google Brain, Google Inc.
  • Lisa, L. (2015). Deep learning tutorial release 0.1.
  • Magoulas, G. D., Vrahatis, M. N., & Androulakis, G. S. (1997). Effective backpropagation training with variable stepsize. Neural Networks, 10(1), 6982.
  • Matloff, N. (2017). Statistical regression and classification: From linear models to machine learning. CRC Press.
  • Mcculloch, W. S., & Pitts, W. (1990). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biology, 52(1–2), 99115. 
  • Meyer, D., Dimitriadou, E., & Hornik, K. (2019). e1071: Misc Functions of the Department of Statistics, Probability Theory Group [Computer software manual]. (R package version 1.7-3).
  • Mitliagkas, I. (2018). Theoretical principles for deep learning (Tech. Rep.). University of Montreal. 
  • Moreira, M., & Fiesler, E. (1995). Neural networks with adaptive learning rate and momentum terms (Tech. Rep.). IDIAP. 
  • Murphy, K. P. (2012). Machine learning: A probabilistic perspective. MIT Press
  • Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate O(1k2) A method of solving a convex programming problem with convergence rate o(1k2). Doklady Akademii Nauk SSSR, 27(2), 372376
  • Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course. Kluwer Academic Publishers
  • Nielsen, M. A. (2019). Neural networks and deep learning.
  • Petalas, Y. G., & Vrahatis, M. N. (2004). Parallel tangent methods with variable stepsize. 2004 IEEE international joint conference on neural networks (IEEE Cat. No. 04ch37541) (Vol. 2, pp. 1063–1066). IEEE. 
  • Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 117. 
  • Prechelt, L. (1994). Proben1: A set of neural network benchmark problems and benchmarking rules (Tech. Rep.). University Karlsruhe. 
  • Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C. Cambridge University Press
  • Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1), 145151. 
  • R Core Team (2020). R: A language and environment for statistical computing [Computer software manual]. 
  • Ripley, B., & Venables, W. (2020). NNET: Feed-forward neural networks and multinomial log-linear models [Computer software manual]. (R Package Version 7.3-13). 
  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by backpropagating errors. Nature, 323, 533536. 
  • Rumelhart, D. E., & McClelland, J. L. (1986). Parallel distributed processing. MIT Press
  • Schonberger, V. M., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt
  • Shah, B. V., Buehler, R. J., & Kempthorne, O. (1964). Some algorithms for minimizing a function of several variables. Journal of the Society for Industrial and Applied Mathematics, 12(1), 7492.
  • Stanford, C. C. (2019). Cs231n: Convolutional neural networks for visual recognition. Stanford University.
  • Wickham, H. (2014). Advanced R. Chapman & Hall/CRC
  • Wickham, H. (2016). GGPLOT2: Elegant graphics for data analysis. Springer
  • Yin, W. (2015). Optimization: Gradient methods (Tech. Rep.). Department of Mathematics, University of California. 

To cite this article: Yongqiang Lian, Yincai Tang & Shirong Zhou (2020): Research on threestep
accelerated gradient algorithm in deep learning, Statistical Theory and Related Fields, DOI:
To link to this article: