Recurrent neural networks

Printer-friendly version

Feedback neural networks structured to have memory and a notion of “current” and “past” states, which can encode time (or whatever).

As someone who does a lot of signal processing for music, the notion that these generalise linear systems theory is suggestive of lots of interesting DSP applications.

The connection between these (IIR) and “convolutional” (FIR) neural networks is suggestive for the same reason.



The main problem here is that they are unstable in the training phase unless you are clever.
See BeSF94. One solution is LSTM; see next.

Long Short Term Memory (LSTM)

As always, Christopher Olah wins the visual explanation prize:
Understanding LSTM Networks
LSTM Networks for Sentiment Analysis:

In a traditional recurrent neural network, during the gradient back-propagation phase, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process.[…]

These issues are the main motivation behind the LSTM model which introduces a new structure called a memory cell…]. A memory cell is composed of four main elements: an input gate, a neuron with a self-recurrent connection (a connection to itself), a forget gate and an output gate. […]The gates serve to modulate the interactions between the memory cell itself and its environment.


A mini-genre.
KaDG15 et al connect recurrent cells across multiple axes, leading to a higher-rank MIMO system;
This is natural in many kinds of spatial random fields, and I am amazed it was uncommon enough to need formalizing in a paper; but it was and it did and good on Kalchbrenner et al.

Gate Recurrent Unit (GRU)


Liquid/ Echo State Machines

This sounds deliciously lazy;
Very roughly speaking, your first layer is a reservoir of random saturating IIR filters.
You fit a classifier on the outputs of this.
Easy to implement, that.
I wonder when it actually works, constraints on topology etc.

I wonder if you can use some kind of sparsifying transform on the recurrence operator?

These claim to be based on spiky models, but AFAICT this is not at all necessary.

Various claims are made about how hard they avoid the training difficulty of similarly basic RNNs by being essentially untrained; you use them as a feature factory for another supervised output algorithm.

Suggestive parallel with random projections.


From a dynamical systems perspective, there are two main classes of RNNs.
Models from the first class are characterized by an energy-minimizing
stochastic dynamics and symmetric connections.
The best known instantiations are Hopfield networks, Boltzmann machines, and
the recently emerging Deep Belief Networks.
These networks are mostly trained in some unsupervised learning scheme.
Typical targeted network functionalities in this field are associative
memories, data compression, the unsupervised modeling of data distributions,
and static pattern classification, where the model is run for multiple time
steps per single input instance to reach some type of convergence or
(but see e.g., TaHR06 for extension to temporal data).
The mathematical background is rooted in statistical physics.
In contrast, the second big class of RNN models typically features a
deterministic update dynamics and directed connections.
Systems from this class implement nonlinear filters, which
transform an input time series into an output time series.
The mathematical background here is nonlinear dynamical systems.
The standard training mode is supervised.
This survey is concerned only with RNNs of this second type, and
when we speak of RNNs later on, we will exclusively refer to such systems.


It’s still the wild west. Invent a category, name it and stake a claim.


Variable sequence length:

Danijar Hafner:
* Introduction to Recurrent Networks in TensorFlow

seq2seq models with GRUs : Fun with Recurrent Neural Nets: One More Dive into CNTK and TensorFlow


Auer, P., Burgsteiner, H., & Maass, W. (2008) A learning rule for very simple universal approximators consisting of a single layer of perceptrons. Neural Networks, 21(5), 786–795. DOI.
Bengio, Y., Simard, P., & Frasconi, P. (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. DOI.
Boulanger-Lewandowski, N., Bengio, Y., & Vincent, P. (2012) Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription. In 29th International Conference on Machine Learning.
Bown, O., & Lexer, S. (2006) Continuous-Time Recurrent Neural Networks for Generative and Interactive Musical Performance. In F. Rothlauf, J. Branke, S. Cagnoni, E. Costa, C. Cotta, R. Drechsler, … H. Takagi (Eds.), Applications of Evolutionary Computing (pp. 652–663). Springer Berlin Heidelberg
Buhusi, C. V., & Meck, W. H.(2005) What makes us tick? Functional and neural mechanisms of interval timing. Nature Reviews Neuroscience, 6(10), 755–765. DOI.
Cho, K., van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014) On the properties of neural machine translation: Encoder-decoder approaches. arXiv Preprint arXiv:1409.1259.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014) Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. In NIPS.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2015) Gated Feedback Recurrent Neural Networks. arXiv:1502.02367 [Cs, Stat].
Doelling, K. B., & Poeppel, D. (2015) Cortical entrainment to music and its modulation by expertise. Proceedings of the National Academy of Sciences, 112(45), E6233–E6242. DOI.
Duan, Q., Park, J. H., & Wu, Z.-G. (2014) Exponential state estimator design for discrete-time neural networks with discrete and distributed time-varying delays. Complexity, 20(1), 38–48. DOI.
Gal, Y. (2015) A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. arXiv:1512.05287 [Stat].
Gers, F. A., Schmidhuber, J., & Cummins, F. (2000) Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10), 2451–2471. DOI.
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015) DRAW: A Recurrent Neural Network For Image Generation. arXiv:1502.04623 [Cs].
Grzyb, B. J., Chinellato, E., Wojcik, G. M., & Kaminski, W. A.(2009) Which model to use for the Liquid State Machine?. In 2009 International Joint Conference on Neural Networks (pp. 1018–1024). DOI.
Hazan, H., & Manevitz, L. M.(2012) Topological constraints and robustness in liquid state machines. Expert Systems with Applications, 39(2), 1597–1606. DOI.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., … Kingsbury, B. (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. DOI.
Hochreiter, S., & Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. DOI.
Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015) An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (pp. 2342–2350).
Kalchbrenner, N., Danihelka, I., & Graves, A. (2015) Grid Long Short-Term Memory. arXiv:1507.01526 [Cs].
Karpathy, A., Johnson, J., & Fei-Fei, L. (2015) Visualizing and Understanding Recurrent Networks. arXiv:1506.02078 [Cs].
LeCun, Y. (1998) Gradient-based learning applied to document recognition. Proc. IEEE, 86(11), 2278–2324. DOI.
Legenstein, R., Naeger, C., & Maass, W. (2005) What Can a Neuron Learn with Spike-Timing-Dependent Plasticity?. Neural Computation, 17(11), 2337–2382. DOI.
Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015) A Critical Review of Recurrent Neural Networks for Sequence Learning. arXiv:1506.00019 [Cs].
Lukoševičius, M., & Jaeger, H. (2009) Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3), 127–149. DOI.
Maass, W., Natschläger, T., & Markram, H. (2004) Computational Models for Generic Cortical Microcircuits. In Computational Neuroscience: A Comprehensive Approach (pp. 575–605). Chapman & Hall/CRC
Miconi, T. (2015) Training recurrent neural networks with sparse, delayed rewards for flexible decision tasks. arXiv:1507.08973 [Q-Bio].
Mnih, V. (2015) Human-level control through deep reinforcement learning. Nature, 518, 529–533. DOI.
Mohamed, A. r, Dahl, G. E., & Hinton, G. (2012) Acoustic Modeling Using Deep Belief Networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22. DOI.
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T.(2015) Visually Indicated Sounds. arXiv:1512.08512 [Cs].
Rohrbach, A., Rohrbach, M., & Schiele, B. (2015) The Long-Short Story of Movie Description. arXiv:1506.01698 [Cs].
Schwenk, H. (2007) Continuous space language models. Computer Speech Lang., 21, 492–518. DOI.
Taylor, G. W., Hinton, G. E., & Roweis, S. T.(2006) Modeling human motion using binary latent variables. In Advances in neural information processing systems (pp. 1345–1352).
Theis, L., & Bethge, M. (2015) Generative Image Modeling Using Spatial LSTMs. arXiv:1506.03478 [Cs, Stat].
Visin, F., Kastner, K., Cho, K., Matteucci, M., Courville, A., & Bengio, Y. (2015) ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks. arXiv:1505.00393 [Cs].
Waibel, A. (1989) Phoneme recognition using time-delay neural networks. IEEE Trans. Acoustics Speech Signal Process., 37(3), 328–339. DOI.
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015) Describing Videos by Exploiting Temporal Structure. arXiv:1502.08029 [Cs, Stat].

See original: The Living Thing / Notebooks Recurrent neural networks