Elsevier

Neural Networks

Volume 21, Issues 2–3, March–April 2008, Pages 193-203
Neural Networks

2008 Special Issue
Second-order stagewise backpropagation for Hessian-matrix analyses and investigation of negative curvature

https://doi.org/10.1016/j.neunet.2007.12.038Get rights and content

Abstract

Multi-stage feed-forward neural network (NN) learning with sigmoidal-shaped hidden-node functions is implicitly constrained optimization featuring negative curvature. Our analyses on the Hessian matrix H of the sum-squared-error measure highlight the following intriguing findings: At an early stage of learning, H tends to be indefinite and much better-conditioned than the Gauss–Newton Hessian JTJ. The NN structure influences the indefiniteness and rank of H. Exploiting negative curvature leads to effective learning. All these can be numerically confirmed owing to our stagewise second-order backpropagation; the systematic procedure exploits NN’s “layered symmetry” to compute H efficiently, making exact Hessian evaluation feasible for fairly large practical problems.

Introduction

In the early literature on multilayer perceptron (MLP)-learning, troublesome situations were often described solely as a local-minimum problem. In the well-known 2-bit parity XOR problem, for instance, a set of nine weight values were specified for a 2-2-1 MLP as a local-minimum example (see p. 332–333 Rumelhart, Hinton, and Williams (1986, Chap. 8)). However, it has been reported later that there is no local minimum in 2-2-1 XOR MLP-learning (Hamey, 1998, S.-Kuyper and Boers, 1998, Sprinkhuizen-Kuyper and Boers, 1999); rather, it was due to a saddle-point issue. Malicious learning situations enigmatically may lead the resultant set of weights to such saddle points; some cases might be related to the singularities summarized in a recent paper (Amari, Park, & Ozeki, 2006) (see also references therein).

MLP-learning is an optimization problem that often involves negative curvature. It is implicitly constrained since hidden-node sigmoidal functions may get saturated during learning. Unlike the previous work mentioned above, we actually evaluate the Hessian matrix H of the sum-squared-error measure, and show that H is indefinite (even in some previously-deemed local-minimum cases) as well as better-conditioned than the Gauss–Newton Hessian (as opposed to the result in Saarinen, Bramley, and Cybenko (1993)) especially at an early learning phase. We then exploit negative curvature for second-order optimization. Along this line of attack, we emphasize the utility of our stagewise second-order backpropagation as efficient Hessian evaluation. In what follows, we first outline the stagewise procedure, an extension of discrete-time optimal-control stagewise-Newton of Dreyfus (1966), and then describe the analyses on H with numerical evidence that disclose several important findings on sparsity, outer-product structure, rank, indefiniteness, condition number, and eigenvalues.

Section snippets

Stagewise backpropagation

Multilayer perceptron (MLP) learning is a discrete-time (or N-stage decision-making) optimal-control problem (Mizutani and Dreyfus, 2005, Mizutani et al., 2000). Each stage s (or layer s) involves Ps-dimensional state vector ys (of Ps node outputs) and ns-dimensional control vector θs,s+1 (of weights including thresholds); hence, ns(1+Ps)Ps+1. In optimal control, the objective function generally consists of stage costs Ls(ys,θs,s+1) that depend on both control θs,s+1 (between layers s and s+1)

Analyses on the Hessian matrix H

The efficient stagewise procedure allows us to analyze H in various aspects that follow.

On exploiting negative curvature

We show experimentally that H is indefinite (because S becomes dominant in H when residuals are large and/or highly nonlinear; see Section 3.4), and H can be much “less” ill-conditioned than JTJ unlike the report in Saarinen et al. (1993). These were observed in the relatively early learning phase. When H is indefinite, the Newton step always moves the current point to the nearest saddle point of the current local quadratic model, as illustrated in Figs. 3(a) and (b), where the eigen-direction

Discussion and conclusion

When the residuals are large and/or highly nonlinear, the Hessian matrix H(=JTJ+S) is prone to be indefinite and much better-conditioned than JTJ. In MLP-learning, special sparsity structure inevitably arises in S, which is separable into Vs, a neat block-diagonal form, and Γs,t, a sparse block of only first derivatives. Sparsity patterns posed in S greatly affect rank and indefiniteness. Although H is expected to be positive definite at the solution, on the way out there, it may not be. The

Acknowledgments

The authors would like to thank John Dennis (Rice University) and James Demmel (UC Berkeley) for useful discussions.

References (31)

  • G.H. Golub et al.

    Separable nonlinear least squares: The variable projection method and its applications

    Inverse Problems

    (2003)
  • P.J.G. Lisboa et al.

    Complete solution of the local minima in the XOR problem

    Network

    (1991)
  • T. Masters

    Advanced algorithms for neural networks: A C++ sourcebook

    (1995)
  • Mizutani, E. (1999). Powell’s dogleg trust-region steps with the quasi-Newton augmented Hessian for neural nonlinear...
  • Mizutani, E. (2005). On computing the Gauss-Newton Hessian matrix for neural-network learning. In Proceedings of the...
  • Cited by (32)

    View all citing articles on Scopus

    An abbreviated version of some portions of this article appeared in Mizutani and Dreyfus (2007) as part of the IJCNN 2007 Conference Proceedings, published under IEE copyright.

    View full text