LSTM: A Search Space Odyssey Klaus Greff, Rupesh K. Srivastava, Jan - - PowerPoint PPT Presentation

lstm a search space odyssey
SMART_READER_LITE
LIVE PREVIEW

LSTM: A Search Space Odyssey Klaus Greff, Rupesh K. Srivastava, Jan - - PowerPoint PPT Presentation

LSTM: A Search Space Odyssey Klaus Greff, Rupesh K. Srivastava, Jan Koutn k, Bas R. Steunebrink, Ju rgen Schmidhuber, 2015. Presenter: Yijun Tian, Zhenyu Liu Abstract In this paper, the authors analyze performance of LSTM and its


slide-1
SLIDE 1

LSTM: A Search Space Odyssey

Presenter: Yijun Tian, Zhenyu Liu

Klaus Greff, Rupesh K. Srivastava, Jan Koutn ́ık, Bas R. Steunebrink, Ju ̈rgen Schmidhuber, 2015.

slide-2
SLIDE 2

NYU Courant

Abstract

  • In this paper, the authors analyze performance of

LSTM and its eight variants on three representative tasks: speech recognition, handwriting recognition and polyphonic music modeling.

  • Hyperparameters for each variant were optimized

individually using random search and importance was gauged using fANOVA (a tool for assessing hyperparameters importance).

slide-3
SLIDE 3

NYU Courant

Dataset

  • TIMIT: the TIMIT Speech

corpus

  • IAM Online: the IAM Online

Handwriting Database

  • JSB Chorales: a collection
  • f 382 fourpart harmonized

chorales by J. S. Bach

Speech Recognition Handwriting Recognition Polyphonic Music Modeling

slide-4
SLIDE 4

NYU Courant

Vanilla LSTM

N: number of LSTM blocks. M: input size

slide-5
SLIDE 5

NYU Courant

LSTM Variants

slide-6
SLIDE 6

NYU Courant

Experiments

  • Performed 27 random searches (one for each combination
  • f the nine variants and three datasets).
  • Each random search encompasses 200 trials of randomly

sampling the following hyperparameters:

  • Number of LSTM blocks per hidden layer.
  • Learning rate, momentum, standard deviation of Gaussian

input noise

slide-7
SLIDE 7

NYU Courant

Results

NOAF and NFG performs significantly worse

slide-8
SLIDE 8

NYU Courant

Results

Learning rate and network size are important hyperparameters

slide-9
SLIDE 9

NYU Courant

Conclusions and Insights

  • None of the variants improve upon the standard LSTM

architecture significantly.

  • Coupling the input and forget gates (CIFG) or removing

peephole connections (NP) are attractive models.

  • Forget gate and output activation function are the most

critical components of the LSTM block.

  • Learning rate and network size are important.
  • No apparent structure of hyperparameter interaction.
slide-10
SLIDE 10

Thank you! Questions?

Take home message: The most commonly used LSTM architecture (vanilla LSTM) performs reasonably well on various datasets.