2008 Special IssueSecond-order stagewise backpropagation for Hessian-matrix analyses and investigation of negative curvature☆
Introduction
In the early literature on multilayer perceptron (MLP)-learning, troublesome situations were often described solely as a local-minimum problem. In the well-known 2-bit parity XOR problem, for instance, a set of nine weight values were specified for a 2-2-1 MLP as a local-minimum example (see p. 332–333 Rumelhart, Hinton, and Williams (1986, Chap. 8)). However, it has been reported later that there is no local minimum in 2-2-1 XOR MLP-learning (Hamey, 1998, S.-Kuyper and Boers, 1998, Sprinkhuizen-Kuyper and Boers, 1999); rather, it was due to a saddle-point issue. Malicious learning situations enigmatically may lead the resultant set of weights to such saddle points; some cases might be related to the singularities summarized in a recent paper (Amari, Park, & Ozeki, 2006) (see also references therein).
MLP-learning is an optimization problem that often involves negative curvature. It is implicitly constrained since hidden-node sigmoidal functions may get saturated during learning. Unlike the previous work mentioned above, we actually evaluate the Hessian matrix of the sum-squared-error measure, and show that is indefinite (even in some previously-deemed local-minimum cases) as well as better-conditioned than the Gauss–Newton Hessian (as opposed to the result in Saarinen, Bramley, and Cybenko (1993)) especially at an early learning phase. We then exploit negative curvature for second-order optimization. Along this line of attack, we emphasize the utility of our stagewise second-order backpropagation as efficient Hessian evaluation. In what follows, we first outline the stagewise procedure, an extension of discrete-time optimal-control stagewise-Newton of Dreyfus (1966), and then describe the analyses on with numerical evidence that disclose several important findings on sparsity, outer-product structure, rank, indefiniteness, condition number, and eigenvalues.
Section snippets
Stagewise backpropagation
Multilayer perceptron (MLP) learning is a discrete-time (or -stage decision-making) optimal-control problem (Mizutani and Dreyfus, 2005, Mizutani et al., 2000). Each stage (or layer ) involves -dimensional state vector (of node outputs) and -dimensional control vector (of weights including thresholds); hence, . In optimal control, the objective function generally consists of stage costs that depend on both control (between layers and )
Analyses on the Hessian matrix
The efficient stagewise procedure allows us to analyze in various aspects that follow.
On exploiting negative curvature
We show experimentally that is indefinite (because becomes dominant in when residuals are large and/or highly nonlinear; see Section 3.4), and can be much “less” ill-conditioned than unlike the report in Saarinen et al. (1993). These were observed in the relatively early learning phase. When is indefinite, the Newton step always moves the current point to the nearest saddle point of the current local quadratic model, as illustrated in Figs. 3(a) and (b), where the eigen-direction
Discussion and conclusion
When the residuals are large and/or highly nonlinear, the Hessian matrix is prone to be indefinite and much better-conditioned than . In MLP-learning, special sparsity structure inevitably arises in , which is separable into , a neat block-diagonal form, and , a sparse block of only first derivatives. Sparsity patterns posed in greatly affect rank and indefiniteness. Although is expected to be positive definite at the solution, on the way out there, it may not be. The
Acknowledgments
The authors would like to thank John Dennis (Rice University) and James Demmel (UC Berkeley) for useful discussions.
References (31)
XOR has no local minima: A case study in neural network error surface analysis
Neural Networks
(1998)- et al.
On structure-exploiting trust-region regularized nonlinear least squares algorithms for neural-network learning
Neural Networks
(2003) - et al.
Singularities affect dynamics of learning in neuro-manifolds
Neural Computation
(2006) Neural networks for pattern recognition
(1995)- et al.
Back propagation fails to separate where perceptrons succeed
IEEE Transactions on Circuits and Systems
(1989) - et al.
A new algorithm for nonlinear least squares curve fitting
- et al.
Applied numerical linear algebra
(1997)- et al.
Numerical methods for unconstrained optimization and nonlinear equations
(1983) The numerical solution of non-linear optimal control problems
Separable nonlinear least squares: The variable projection method and its applications
Inverse Problems
Complete solution of the local minima in the XOR problem
Network
Advanced algorithms for neural networks: A C++ sourcebook
Cited by (32)
On improving Cheng’s backpropagation for fully connectedcascade networks
2022, Research SquareOptimization methods for deep neural networks
2021, AIP Conference ProceedingsThe performance of various optimizers in machine learning
2021, International Journal of Engineering Trends and Technology
- ☆
An abbreviated version of some portions of this article appeared in Mizutani and Dreyfus (2007) as part of the IJCNN 2007 Conference Proceedings, published under IEE copyright.