2005 Special IssueFramewise phoneme classification with bidirectional LSTM and other neural network architectures
Introduction
For neural networks, there are two main ways of incorporating context into sequence processing tasks: collect the inputs into overlapping time-windows, and treat the task as spatial; or use recurrent connections to model the flow of time directly. Using time-windows has two major drawbacks: firstly the optimal window size is task dependent (too small and the net will neglect important information, too large and it will overfit on the training data), and secondly the network is unable to adapt to shifted or timewarped sequences. However, standard RNNs (by which we mean RNNs containing hidden layers of recurrently connected neurons) have limitations of their own. Firstly, since they process inputs in temporal order, their outputs tend to be mostly based on previous context (there are ways to introduce future context, such as adding a delay between the outputs and the targets; but these do not usually make full use of backwards dependencies). Secondly they are known to have difficulty learning time-dependencies more than a few timesteps long (Hochreiter et al., 2001). An elegant solution to the first problem is provided by bidirectional networks (Section 2). For the second problem, an alternative RNN architecture, LSTM, has been shown to be capable of learning long time-dependencies (Section 3).
Our experiments concentrate on framewise phoneme classification (i.e. mapping a sequence of speech frames to a sequence of phoneme labels associated with those frames). This task is both a first step towards full speech recognition (Robinson, 1994, Bourlard and Morgan, 1994), and a challenging benchmark in sequence processing. In particular, it requires the effective use of contextual information.
The contents of the rest of this paper are as follows: in Section 2 we discuss bidirectional networks, and answer a possible objection to their use in causal tasks; in Section 3 we describe the Long Short Term Memory (LSTM) network architecture, and our modification to its error gradient calculation; in Section 4 we describe the experimental data and how we used it in our experiments; in Section 5 we give an overview of the various network architectures; in Section 6 we describe how we trained (and retrained) them; in Section 7 we present and discuss the experimental results, and in Section 8 we make concluding remarks. Appendix A contains the pseudocode for training LSTM networks with a full gradient calculation, and Appendix B is an outline of bidirectional training with RNNs.
Section snippets
Bidirectional recurrent neural nets
The basic idea of bidirectional recurrent neural nets (BRNNs) (Schuster and Paliwal, 1997, Baldi et al., 1999) is to present each training sequence forwards and backwards to two separate recurrent nets, both of which are connected to the same output layer. (In some cases a third network is used in place of the output layer, but here we have used the simpler model). This means that for every point in a given sequence, the BRNN has complete, sequential information about all points before and
LSTM
The Long Short Term Memory architecture (Hochreiter and Schmidhuber, 1997, Gers et al., 2002) was motivated by an analysis of error flow in existing RNNs (Hochreiter et al., 2001), which found that long time lags were inaccessible to existing architectures, because backpropagated error either blows up or decays exponentially.
An LSTM layer consists of a set of recurrently connected blocks, known as memory blocks. These blocks can be thought of as a differentiable version of the memory chips in a
Experimental data
The data for our experiments came from the TIMIT corpus (Garofolo et al., 1993) of prompted utterances, collected by Texas Instruments. The utterances were chosen to be phonetically rich, and the speakers represent a wide variety of American dialects. The audio data is divided into sentences, each of which is accompanied by a complete phonetic transcript.
We preprocessed the audio data into 12 Mel-Frequency Cepstrum Coefficients (MFCC's) from 26 filter-bank channels. We also extracted the
Network architectures
We used the following five neural network architectures in our experiments (henceforth referred to by the abbreviations in brackets):
- •
Bidirectional LSTM, with two hidden LSTM layers (forwards and backwards), both containing 93 one-cell memory blocks of one cell each (BLSTM)
- •
Unidirectional LSTM, with one hidden LSTM layer, containing 140 one cell memory blocks, trained backwards with no target delay, and forwards with delays from 0 to 10 frames (LSTM)
- •
Bidirectional RNN with two hidden layers
Network training
For all architectures, we calculated the full error gradient using online BPTT (BPTT truncated to the lengths of the utterances), and trained the weights using gradient descent with momentum. We kept the same training parameters for all experiments: initial weights randomised in the range [−0.1,0.1], a learning rate of 10−5 and a momentum of 0.9. At the end of each utterance, weight updates were carried out and network activations were reset to 0.
Keeping the training algorithm and parameters
Results
Table 1 contains the outcomes of 7, randomly initialised, training runs with BLSTM. For the rest of the paper, we use their mean as the result for BLSTM. The standard deviation in the test set scores (0.2%) gives an indication of significant difference in network performance.
The last three entries in Table 2 come from the papers indicated (note that Robinson did not quote framewise classification scores; the result for his network was recorded by Schuster, using the original software). The rest
Conclusion and future work
In this paper we have compared bidirectional LSTM to other neural network architectures on the task of framewise phoneme classification. We have found that bidirectional networks are significantly more effective than unidirectional ones, and that LSTM is much faster to train than standard RNNs and MLPs, and also slightly more accurate. We conclude that bidirectional LSTM is an architecture well suited to this and other speech processing tasks, where context is vitally important.
In the future we
Acknowledgements
The authors would like to thank Nicole Beringer for her expert advice on linguistics and speech recognition. This work was supported by the SNF under grant number 200020100249.
References (24)
- et al.
Capturing long-term dependencies for protein secondary structure prediction
- et al.
Exploiting the past and the future in protein secondary structure prediction
BIOINF: Bioinformatics
(1999) - et al.
Bidirectional dynamics for protein secondary structure prediction
Lecture Notes in Computer Science
(2001) Human language acquisition in a machine learning task
(2004)Human language acquisition methods in a machine learning task
(2004)Neural networks for pattern recognition
(1995)- et al.
Connnectionist speech recognition: A hybrid approach
(1994) - et al.
Experiments on the implementation of recurrent neural networks for speech phone recognition
(1996) - Eck, D., Graves, A., & Schmidhuber, J. (2003). A new approach to continuous speech recognition using LSTM recurrent...
- et al.
Phoneme boundary estimation using bidirectional recurrent neural networks and its applications
Systems and Computers in Japan
(1999)
Darpa timit acoustic phonetic continuous speech corpus cdrom
Learning precise timing with LSTM recurrent networks
Journal of Machine Learning Research
Cited by (4625)
Research on public opinion effecting on stock price during crises based on model checking
2024, Expert Systems with ApplicationsAbstractive text summarization model combining a hierarchical attention mechanism and multiobjective reinforcement learning
2024, Expert Systems with ApplicationsInterpretable extractive text summarization with meta-learning and BI-LSTM: A study of meta learning and explainability techniques
2024, Expert Systems with ApplicationsForecasting urban air pollution using multi-site spatiotemporal data fusion method (Geo-BiLSTMA)
2024, Atmospheric Pollution ResearchA fine-grained robust performance diagnosis framework for run-time cloud applications
2024, Future Generation Computer Systems
- 1
An abbreviated version of some portions of this article appeared in (Graves and Schmidhuber, 2005), as part of the IJCNN 2005 conference proceedings, published under the IEEE copyright.