Elsevier

Neural Networks

Volume 18, Issues 5–6, July–August 2005, Pages 602-610
Neural Networks

2005 Special Issue
Framewise phoneme classification with bidirectional LSTM and other neural network architectures

https://doi.org/10.1016/j.neunet.2005.06.042Get rights and content

Abstract

In this paper, we present bidirectional Long Short Term Memory (LSTM) networks, and a modified, full gradient version of the LSTM learning algorithm. We evaluate Bidirectional LSTM (BLSTM) and several other network architectures on the benchmark task of framewise phoneme classification, using the TIMIT database. Our main findings are that bidirectional networks outperform unidirectional ones, and Long Short Term Memory (LSTM) is much faster and also more accurate than both standard Recurrent Neural Nets (RNNs) and time-windowed Multilayer Perceptrons (MLPs). Our results support the view that contextual information is crucial to speech processing, and suggest that BLSTM is an effective architecture with which to exploit it.1

Introduction

For neural networks, there are two main ways of incorporating context into sequence processing tasks: collect the inputs into overlapping time-windows, and treat the task as spatial; or use recurrent connections to model the flow of time directly. Using time-windows has two major drawbacks: firstly the optimal window size is task dependent (too small and the net will neglect important information, too large and it will overfit on the training data), and secondly the network is unable to adapt to shifted or timewarped sequences. However, standard RNNs (by which we mean RNNs containing hidden layers of recurrently connected neurons) have limitations of their own. Firstly, since they process inputs in temporal order, their outputs tend to be mostly based on previous context (there are ways to introduce future context, such as adding a delay between the outputs and the targets; but these do not usually make full use of backwards dependencies). Secondly they are known to have difficulty learning time-dependencies more than a few timesteps long (Hochreiter et al., 2001). An elegant solution to the first problem is provided by bidirectional networks (Section 2). For the second problem, an alternative RNN architecture, LSTM, has been shown to be capable of learning long time-dependencies (Section 3).

Our experiments concentrate on framewise phoneme classification (i.e. mapping a sequence of speech frames to a sequence of phoneme labels associated with those frames). This task is both a first step towards full speech recognition (Robinson, 1994, Bourlard and Morgan, 1994), and a challenging benchmark in sequence processing. In particular, it requires the effective use of contextual information.

The contents of the rest of this paper are as follows: in Section 2 we discuss bidirectional networks, and answer a possible objection to their use in causal tasks; in Section 3 we describe the Long Short Term Memory (LSTM) network architecture, and our modification to its error gradient calculation; in Section 4 we describe the experimental data and how we used it in our experiments; in Section 5 we give an overview of the various network architectures; in Section 6 we describe how we trained (and retrained) them; in Section 7 we present and discuss the experimental results, and in Section 8 we make concluding remarks. Appendix A contains the pseudocode for training LSTM networks with a full gradient calculation, and Appendix B is an outline of bidirectional training with RNNs.

Section snippets

Bidirectional recurrent neural nets

The basic idea of bidirectional recurrent neural nets (BRNNs) (Schuster and Paliwal, 1997, Baldi et al., 1999) is to present each training sequence forwards and backwards to two separate recurrent nets, both of which are connected to the same output layer. (In some cases a third network is used in place of the output layer, but here we have used the simpler model). This means that for every point in a given sequence, the BRNN has complete, sequential information about all points before and

LSTM

The Long Short Term Memory architecture (Hochreiter and Schmidhuber, 1997, Gers et al., 2002) was motivated by an analysis of error flow in existing RNNs (Hochreiter et al., 2001), which found that long time lags were inaccessible to existing architectures, because backpropagated error either blows up or decays exponentially.

An LSTM layer consists of a set of recurrently connected blocks, known as memory blocks. These blocks can be thought of as a differentiable version of the memory chips in a

Experimental data

The data for our experiments came from the TIMIT corpus (Garofolo et al., 1993) of prompted utterances, collected by Texas Instruments. The utterances were chosen to be phonetically rich, and the speakers represent a wide variety of American dialects. The audio data is divided into sentences, each of which is accompanied by a complete phonetic transcript.

We preprocessed the audio data into 12 Mel-Frequency Cepstrum Coefficients (MFCC's) from 26 filter-bank channels. We also extracted the

Network architectures

We used the following five neural network architectures in our experiments (henceforth referred to by the abbreviations in brackets):

  • Bidirectional LSTM, with two hidden LSTM layers (forwards and backwards), both containing 93 one-cell memory blocks of one cell each (BLSTM)

  • Unidirectional LSTM, with one hidden LSTM layer, containing 140 one cell memory blocks, trained backwards with no target delay, and forwards with delays from 0 to 10 frames (LSTM)

  • Bidirectional RNN with two hidden layers

Network training

For all architectures, we calculated the full error gradient using online BPTT (BPTT truncated to the lengths of the utterances), and trained the weights using gradient descent with momentum. We kept the same training parameters for all experiments: initial weights randomised in the range [−0.1,0.1], a learning rate of 10−5 and a momentum of 0.9. At the end of each utterance, weight updates were carried out and network activations were reset to 0.

Keeping the training algorithm and parameters

Results

Table 1 contains the outcomes of 7, randomly initialised, training runs with BLSTM. For the rest of the paper, we use their mean as the result for BLSTM. The standard deviation in the test set scores (0.2%) gives an indication of significant difference in network performance.

The last three entries in Table 2 come from the papers indicated (note that Robinson did not quote framewise classification scores; the result for his network was recorded by Schuster, using the original software). The rest

Conclusion and future work

In this paper we have compared bidirectional LSTM to other neural network architectures on the task of framewise phoneme classification. We have found that bidirectional networks are significantly more effective than unidirectional ones, and that LSTM is much faster to train than standard RNNs and MLPs, and also slightly more accurate. We conclude that bidirectional LSTM is an architecture well suited to this and other speech processing tasks, where context is vitally important.

In the future we

Acknowledgements

The authors would like to thank Nicole Beringer for her expert advice on linguistics and speech recognition. This work was supported by the SNF under grant number 200020100249.

References (24)

  • J. Chen et al.

    Capturing long-term dependencies for protein secondary structure prediction

  • P. Baldi et al.

    Exploiting the past and the future in protein secondary structure prediction

    BIOINF: Bioinformatics

    (1999)
  • P. Baldi et al.

    Bidirectional dynamics for protein secondary structure prediction

    Lecture Notes in Computer Science

    (2001)
  • N. Beringer

    Human language acquisition in a machine learning task

    (2004)
  • N. Beringer

    Human language acquisition methods in a machine learning task

    (2004)
  • C. Bishop

    Neural networks for pattern recognition

    (1995)
  • H. Bourlard et al.

    Connnectionist speech recognition: A hybrid approach

    (1994)
  • R. Chen et al.

    Experiments on the implementation of recurrent neural networks for speech phone recognition

    (1996)
  • Eck, D., Graves, A., & Schmidhuber, J. (2003). A new approach to continuous speech recognition using LSTM recurrent...
  • T. Fukada et al.

    Phoneme boundary estimation using bidirectional recurrent neural networks and its applications

    Systems and Computers in Japan

    (1999)
  • J.S. Garofolo et al.

    Darpa timit acoustic phonetic continuous speech corpus cdrom

    (1993)
  • F. Gers et al.

    Learning precise timing with LSTM recurrent networks

    Journal of Machine Learning Research

    (2002)
  • Cited by (4625)

    View all citing articles on Scopus
    1

    An abbreviated version of some portions of this article appeared in (Graves and Schmidhuber, 2005), as part of the IJCNN 2005 conference proceedings, published under the IEEE copyright.

    View full text