Mathematical understanding of RNN and its variants

Machine Learning Deep Learning Neural Network

A specific kind of Deep Learning (DL) known as recurrent neural networks (RNNs) excels at analyzing input consecutively. They are widely used in several fields, such as Natural Language Processing (NLP), language translation and many others. This article will examine a number of well-liked RNN versions and dive into the underlying mathematical ideas.

Basics of Recurrent Neural Networks

Recurrent neural networks are a specific type of neural network structure that can deal with information in sequence by maintaining an inner state. They are also known as hidden states. An RNN works similarly for every component in a sequence while preserving and changing its hidden state. This is the fundamental principle underlying it. The computer network can record dependencies and time-related data by using the hidden state as a memory of the earlier items in the sequence.

The Mathematical Formulation of RNNs

Think about a straightforward RNN with one hidden layer. Provided an input series of length T, we can represent the input sequence at the time point t as x(t) and the unseen state at time point t as h(t). The RNN modifies its undisclosed state using the below calculations.

W(hh)h(t-1) + W(xh)x(t) + b(h) = h(t)     [Equation 1]
W(hy)h(t) + b(y) = y(t).                  [Equation 2]

In Equation 1, W(hh) stands for the hidden-to-hidden interactions weight matrix, W(xh) for the input-to-hidden association's weight matrix, b(h) for the hidden layer bias vector, and f for the element-wise implemented activation function. The result obtained at time step t is represented by Equation 2, where W(hy) is the weight matrix linking the hidden layer to the result layer, b(y) is the bias vector for the result layer, and g is the activation function used to create the result.

Backpropagation Through Time (BPTT)

We must calculate gradients and modify the model variables to create an RNN. The gradient calculation algorithm in RNNs is called backpropagation through time (BPTT). The backpropagation via time steps (BPTT) algorithm is a modified version of the common backpropagation technique.

Exploring RNN Variants

(a) Long Short-Term Memory (LSTM)

An RNN variation that avoids the problem of disappearing gradients and can detect dependency over time is the LSTM. Three gating mechanisms—the entry gate, forget gate, and output gate—as well as a memory cell, are introduced by LSTM. These gates regulate the data that flow within and outside of the memory cell, enabling the network to keep or delete particular pieces of data as needed.

(b) Gated Recurrent Unit (GRU)

This RNN variation achieves comparable results while streamlining the LSTM design. GRU blends the cell state and hidden state and merges the forget and input gates of the LSTM into one modification gate. This reduction in complexity makes it easier to compute by lowering the number of necessary operations and variables.

(c) Bidirectional RNN

Making good forecasts can sometimes depend on data from both previous and future time steps. To capture dependencies in both time-related directions, bidirectional RNNs (BiRNNs) procedure the ordered list in both forward and backward directions. BiRNNs have demonstrated success in tasks like identified entities and audio recognition.

(d) Attention Mechanism

The addition of the notification mechanism to RNNs enables the network to concentrate on pertinent segments of the input sequence. Attention mechanisms construct a weighted total of all hidden states rather than depending exclusively on the RNN's final hidden state, giving more weight to important portions of the input sequence. This allows the model to flexibly pay attention to certain features and enhances the way it performs in tasks like machine learning.

(e) Transformer-based Models

RNNs and their variations have achieved widespread popularity, however, they have drawbacks when it comes to parallel computing and resolving long-distance dependencies. Transformer-based models, which Vaswani et al. first presented in 2017, have become a strong substitute. Transformers use attention-to-oneself mechanisms to process the whole input sequence in contrast to RNNs.

The process of self-attention is the primary mathematical element in Transformers. When making estimations, it enables the model to consider the relative weights of various input sequence places. The model effectively captures both local and global interdependence thanks to the computation of the attention weights, which compare each place in the order with every other place.

In natural language processing activities like automated translation. and language synthesis generators have shown impressive outcomes. Large-scale pre-training models like BERT, GPT, and T5 have formed their foundation, considerably advancing the latest developments in a variety of uses for NLP.

Despite their growing popularity, Transformers should not be considered a straight alternative to RNNs. RNNs continue to perform well in applications like speech recognition and time series analysis where temporal data and order of processing are essential. RNNs or Transformers (or a combination of the two), based on the nature of the challenge, must be chosen.

Conclusion

The discipline of sequential information modelling has undergone a revolution thanks to Recurrent Neural Networks (RNNs) and their variations. RNNs have produced cutting-edge results in a variety of fields thanks to their capacity to recognise temporal connections and manage variable-length inputs. In this blog, we looked at the mathematical knowledge of RNNs, including the Backpropagation Through Time (BPTT) algorithm and its fundamental formulation. We also covered a few well-liked RNN subtypes, including LSTM, GRU, bidirectional RNNs, and focused systems. These variations help RNNs succeed in a variety of applications by addressing issues including disappearing gradients, long-term dependencies, and collecting bidirectional information. RNNs and their derivatives will probably continue to lead the sequential evaluation of data as deep learning advances, enabling innovations in speech recognition, natural language processing, and other fields.

Ayush Singh

Updated on: 31-Jul-2023

196 Views

Kickstart Your Career

Get certified by completing the course

Get Started