Long Short-Term Memory Networks

October 16, 2023

4 mins read

Neural networks, inspired by the human brain, have revolutionized artificial intelligence, enabling machines to learn complex patterns from data. These networks consist of interconnected nodes, akin to neurons, which process information and make predictions. Training neural networks involves optimizing their parameters to minimize errors. This is achieved through backpropagation, a technique that adjusts these parameters using gradient descent, iteratively minimizing the difference between predicted and actual outcomes.

At its core, a neural network is a complex network of interconnected nodes (thought as artificial neurons) organized into layers as shown above. The input layer receives data, which is then processed through hidden layers using weighted connections. These weights are adjusted during training, allowing the network to learn patterns and make predictions. The output layer produces the final results, such as classifications or predictions.

Recurrent Neural Networks (RNNs) and the Challenge of Long-Term Dependencies

RNNs are a class of neural networks designed to process sequential data, making them ideal for tasks like language modeling and time series prediction. Unlike feedforward neural networks, RNNs have connections that form cycles, allowing them to maintain a hidden state, capturing information from past inputs.

However, RNNs face a significant challenge when dealing with long-term dependencies. As information travels through the network, it often diminishes, leading to what is known as the vanishing gradient problem. This phenomenon occurs when gradients, used to update network parameters, become infinitesimally small, hindering the learning of long-range dependencies.

Introducing Long Short-Term Memory (LSTM) Networks

To address the vanishing gradient problem and capture long-term dependencies more effectively, Hochreiter and Schmidhuber introduced Long Short-Term Memory (LSTM) networks in 1997. LSTMs are a type of RNN specifically crafted to store information for extended periods, making them well-suited for tasks where context over long sequences is crucial, such as machine translation and speech recognition.

The main intuition behind it is to use two different paths for long term dependencies and short term dependencies. The picture below shows a sequence of LSTM units, where we can spot a green line that represents Long Term Memory, and a purple line that represents Short Term Memory.

We refer to the Long Term Memory as “Cell State”, since it’s very stable and interacts with every cell of the sequence. On the other hand the Short Term Memory is referred to as “Hidden State”, and as shown in the picture it gets “regenerated” at each step.

The Idea Behind LSTMs

At the heart of an LSTM cell are three main components: the Cell State, the Hidden State and three gates. The cell state acts as a conveyor belt, passing information along, selectively modified by gates. These gates regulate the flow of information, deciding what information to store, what to discard, and the information to be outputted to the next layer or as the final prediction.

These gates allow an LSTM to selectively update, add, or remove information from its cell state, which enables it to carry information across many time steps in a sequence. To get more in detail of what the gates actually do let’s break down the unit:

Forget Gate: determines what information to discard from the previous cell state. It takes the previous cell state C_t-1 and the current input X_t as input and passes them through a sigmoid activation function. This produces a value between 0 and 1, where 1 means “keep this information” and 0 means “forget this information”. This percentage is then multiplied with the previous cell state.

Input Gate: determines what new information to store in the current cell state. It takes the previous cell state C_t-1 and the current input X_t, and processes them through two separate activations: a sigmoid activation (to generate the percentage of the candidate value to be added to the cell state) and a tanh activation (to generate the candidate value).

Output Gate: determines what to output based on the current cell state. It takes the previous cell state (C_{t-1}) and the current input (X_t) and passes them through a sigmoid activation. It also takes the current candidate value (C_t) after considering the input gate. These are then element-wise multiplied to produce the current hidden state (H_t).

Advantages of LSTMs

1. Capturing Long-Term Dependencies: LSTMs excel at capturing dependencies in data sequences that span many time steps, which is invaluable for tasks like speech recognition where understanding phonemes and words often requires context from earlier parts of the audio.

2. Preventing Vanishing Gradient:The architecture of LSTMs enables the network to learn which information to retain or discard, mitigating the vanishing gradient problem and ensuring that relevant information is propagated through time steps.

3. Versatility: LSTMs can be used for various applications, including natural language processing, speech recognition, and even in generating creative content like music or poetry.

Challenges and Considerations

While LSTMs are powerful, they are not without challenges. They can be computationally intensive and may require significant amounts of data to train effectively. Additionally, tuning the architecture and hyperparameters can be complex, requiring a deep understanding of the problem domain.

Conclusion

Long Short-Term Memory networks have significantly enhanced the capabilities of RNNs, allowing them to tackle tasks that involve intricate, long-term dependencies in data sequences. By addressing the vanishing gradient problem and providing a mechanism for the selective retention of information, LSTMs have become a cornerstone in the realm of deep learning, paving the way for advancements in natural language processing, speech recognition, and various other fields. As research continues, these networks are likely to evolve, unlocking new possibilities and applications in the world of artificial intelligence.

Authors: Francesco Vinciguerra, Florian Daefler, Angela Lavinia Varazi