Effective Approaches to Attention-based Neural Machine Translatio

Pre-study

Machine Translating(MT): Important sub-field of nlp that aims the translation of natural language sentences using computers
-From hand-crafted translation and linguistic knowledge, data-driven approaches to learn linguistic data from data(SMT) diverse approaches of machin translation existed
Neural Machine Translation(NMT): A state-of-the-art machine translation approach that utilizes neural network techniques to predict the likelihood of a set of words in sequence
-Translation Quality:
Due to NMT’s ability to consider the entire context of a sentence, as opposed to translating piece by piece, it tends to produce translations that are more fluent and accurate—the translations sound more like a native speaker and typically are closer to the intended meaning of the original text
-Memory Efficiency:
NMT uses neural networks, which, despite their complexity, can be more memory-efficient than the large statistical models of SMT
This is because NMT learns a dense representation of language rather than storing and retrieving vast tables of phrases and translations
-End-to-End Training:
NMT systems are trained end-to-end, which means that all parts of the model are trained simultaneously to optimize translation performance. In contrast, SMT systems involve several distinct models (such as language models, alignment models, and translation models) that are trained separately and then brought together in a pipeline
-Model Simplicity:
Traditional SMT systems are composed of many different sub-components, each requiring separate tuning and optimization (like translation rules, reordering models, language models, etc.)
This can make the system quite complex and cumbersome to manage. NMT simplifies this by using a single, large neural network that learns to perform the translation task from start to finish without needing to explicitly program all the different sub-tasks involved in translation
-Contextual Understanding:
NMT models, especially those using attention mechanisms or transformer architectures, are better at capturing long-range dependencies within a sentence
This means they can better understand how words relate to each other in a sentence, leading to translations that consider the entire input sequence as a whole rather than in isolated parts
Ref:
https://www.sciencedirect.com/science/article/pii/S2666651020300024
https://omniscien.com/faq/what-is-neural-machine-translation/
Encoder-Decoder Architecture
-Example: The cat ate the mouse -> Le chat a mange la souris
-Seq to Seq architecture
Consume sequences and splits out sequences
-Two Stages
Encoder Stage:
Produce vector, or representation of the architecture
Where the system takes in all the information it needs and packs it into a neat package. It’s like creating a summary or a blueprint of the information
Decoder stage:
Creates sequence
System takes that neat package and starts to build something with it, like a sequence of steps or instructions. It’s like using a blueprint to construct a model or telling a story based on a summary
Encoder that summarizes everything into a compact form, and then the Decoder that uses that summary to create a new sequence or output
Ref: https://www.youtube.com/watch?v=zbdong_h-x4
Attention Mechanism
-Traditional RNN Encoder Decoder
AttentionBasedNMT2
(1)Model takes one word at a time as input, updates the hidden state, and passes to the next time step
(2)Final hidden state is passed to the Decoder
(3)Decoder works with the final hidden state for processing and translates this to the target language
AttentionBasedNMT3

-Example:
Translate English sentence to French sentence
AttentionBasedNMT1
Encoder-Decoder structure, popular for translating sentences is used
Problem: Words in the source language do not align with the words in the target language
In the sentence Black cat eat the mouse, the first English words is Black, while the translated word in France is chat, meaning cat
-Solution: Attention Mechanism to the Encoder-Decoder structure

Allows neural network to focus specific parts of an input sequence
Done by assigning weights to different parts of the input sequence
Most important parts, receiving the highest weights

Abstract

Problem:
The attention mechanism itself, became widely popular, allowing the model to focus on the most relevant parts of the input text while generating the translation
However, the specific ways in which attention mechanisms can be integrated into the architecture of NMT systems are diverse and not yet fully explored
There’s potential for innovative research to develop new models that incorporate attention in different ways, which could further improve translation performance
Solution(What is written in the paper)
Two types of attention mechanisms that can be used in Neural Machine Translation (NMT) systems
(1)Global Attention
Considers the entire input sentence at once when translating any part of it
No matter which word the system is currently translating, it has the whole original sentence to draw on for context
This is like having an overview of the entire landscape when taking a photograph; you can see everything from the start
(2)Local Attention
Focuses on just a part of the input sentence at a time
When the system translates a specific word, it only looks at a few nearby source words for guidance
This is akin to taking a close-up photo of a subject, where you only focus on a small area and blur out the rest
=> Both methods of MT work well
Proof:
Researchers tested their translation methods from English to German and from German to English and the results show high performance
Performance:
Local attention: Achieved a significant gain of 5.0 BLEUpoints over non-attentional systems that already incorporate known techniques such as dropout
Ensemble model:
Use different attention architectures
Yields a new state-of-the-art result in the WMT’15 English to German translation task with 25.9 BLEU points
An improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker
Footnote:
BLEU
Single numeric score that tells us how good it is compared to reference translation
Geometric mean of all four n-gram precisions
-Unigram Precision = Num word matches / Num words in generation
-Clip is used to clip the number of times to count a word, based on the reference translation based on the number of times it appears in the reference translation, to avoid overgenerating reasonable words
-To solve the word ordering problems, BLEU computes precision for several different n-grams and averages the result
Ref: https://www.youtube.com/watch?v=M05L1DhFqcw

Introduction

**What makes NMT **

2 Neural Machine Translation

Neural Machine Translation System: Neural network that directly models the conditional probability $p(y|x)$ of translating a source sentence $x1,…, xn$ to a target sentence $y1,…, ym$
Two Components of basic form of NFT
(a) Encoder: Computes a representation s for each source sentence
-The encoder processes the source sentence and generates a context-rich vector representation
-This representation is meant to capture the entire meaning of the source sentence, containing all the necessary information to generate the target sentence
(b) Decoder: Generates one target word at a time and hence decomposes the conditional probability
-The decoder takes the sentence representation s and starts generating the target sentence one word at a time
-The probability of the entire target sentence $p(y∣x)$ is the product of the probabilities of each word given all the previous words and the source sentence
AttentionBasedNMT4
-Each $p(y_j | y_<j, s)$ represents the probability of the j-th word given the sentence representations and all the previous words
-The decomposition makes the computation tractable since you can now generate each word sequentially and multiply the probabilities (or sum the logarithms of the probabilities for numerical stability)
Mathematically, for a target sentence y with words y₁, y₂, …, y_m the joint probability can be expressed as:
AttentionBasedNMT6
-Using the logarithm in the probability equation has several advantages in computational term
Numerical Stability:
Probabilities can be very small numbers, and multiplying many such small probabilities (as one would do to compute the joint probability of a sequence of words) can lead to numerical underflow, where the numbers become too small for the computer to represent accurately
Additive Property: Logarithms turn products into sums, which is computationally more convenient. In machine learning algorithms, especially those that involve optimization, it’s easier to work with sums than with products because sums tend to have nicer analytical properties. Gradients of sums, for instance, are easier to compute than gradients of products
-Most of the recent NMT works has a natural choice to model such a decomposition in the decoder is to use recurrent neural network (RNN) architecture
They, however, differ in terms of which RNN architectures are used for the decoder and how the encodercomputes the source sentence representation s
Decoder’s operation in an NMT system

Calculate Current Hidden State h_j
Use the function f, which could be an RNN, GRU, or LSTM unit, to compute the current hidden state h_j by considering the source sentence representation s and the previous hidden state h_j-1
Calculate the Conditional Probability of the
Calculate the Conditional Probability of the j-th word
(2)-1 Transformation Function g
Apply the transformation function g to the hidden state h_j. This function produces a vector of raw scores (also known as logits), one for each word in the target language vocabulary
(2)-2 Softmax Function
Put the vector of scores through the softmax function to convert them into a probability distribution
The softmax function ensures that the scores are non-negative and sum up to 1, making them proper probabilities
(2)-3 Choose the next word
Select the word with the highest probability as the next word in the translation. This selection is based on the probability distribution produced by the softmax function
Example:
Each of these steps is performed for every new word until the end of the target sentence is reached, which is typically indicated by a special end-of-sentence token
The iterative nature of this process allows the decoder to consider each new word in the context of the words that have come before it, ensuring that the translation is coherent and contextually appropriate
-Example Walkthrough:
System is currently translating the English sentence “The house is blue” into Spanish
Condition:
It has already generated the first part of the translation: “La casa”
Now, the system is about to generate the next word in the sequence, which should be the translation of “is”
Calculate Current Hidden State h_j
The RNN hidden state h_j for the current step (let’s say it’s step 3, corresponding to the word “is”) is computed based on the previous hidden state h_j-1 (from step 2, which would have been after generating “casa”) and the source sentence representation s. Let’s denote this abstractly as h₃ = f(h₂, s)
This hidden state is a vector that contains information about the current context of the translation process
Calculate the Conditional Probability of the j-th word
(2)-1 Transformation Function g
The function g takes the hidden state h₃ and produces a vocabulary-sized vector of raw scores
For a simple example, let’s say our vocabulary only includes four possible words for the next word in the translation: [“es”, “está”, “era”, “fue”]
The function g outputs a score for each of these words
Let’s say g(h₃) gives us the vector [2.1, 0.8, -1.0, -0.5]
(2)-2 Softmax Function
The softmax function is applied to the scores to convert them into a probability distribution
Each score is exponentiated and then divided by the sum of the exponentiated scores for all words in the vocabulary, making all probabilities sum to 1
Applying softmax to [2.1, 0.8, -1.0, -0.5] might give us something like [0.7, 0.2, 0.05, 0.05], which sums to 1
(2)-3 Choose the next word
-The probabilities correspond to the translation options for “is”: [“es”, “está”, “era”, “fue”]
Given the context “La casa”, the system uses the highest probability to select the next word
-Since “es” has the highest probability (0.7), the system would choose “es” as the next word in the translation
So now our Spanish translation reads “La casa es”, and the system has correctly translated “The house is” up to this point
The system continues this process, generating one word at a time until the full translation is completed
What f did we use?
Stacked LSTM architecture following LSTM Unit Defined in (Zaremba et al., 2015)
-Stacked LSTM:
A neural network architecture where multiple layers of LSTMs are stacked on top of each other
The output of one layer of LSTM units is fed as input to the next layer
Stacking LSTMs helps the model to learn more complex representations and can improve its ability to handle the intricacies of language translation
-LSTM Unit Defined in (Zaremba et al., 2015):
The LSTM (Long Short-Term Memory) units are a special kind of RNN (Recurrent Neural Network) cell that are designed to capture long-range dependencies and avoid problems like vanishing or exploding gradients
The reference to Zaremba et al., 2015 suggests that a specific formulation or variant of the LSTM cell from that research is being used, which may include certain optimizations or configurations that were found to be effective
-Well-known regularization technique dropout, doesn’t work well for RNN and LSTMs. So the paper introduces a simple regularization technique for RNN with LSTM units, which substantially reduces overfitting
Dropout: Dropout is a common method for preventing overfitting in neural networks
Overfitting occurs when a model learns patterns specific to the training data so well that it performs poorly on unseen data
Dropout works by randomly “dropping out” (i.e., setting to zero) a number of output features of the network during training, which helps to prevent co-adaptation of features and forces the network to develop a more robust representation
Problem: Conventional dropout was not effective for RNNs because the recurrent nature of these networks could amplify the noise introduced by dropout, leading to detrimental effects on the learning process
(The paper references prior work by Bayer et al. (2013), who explored “marginalized dropout” as proposed by Wang & Manning (2013). Marginalized dropout is a variation of dropout that aims to reduce the variance of the noise introduced by standard dropout)
Conventional dropout is like randomly turning off some lights in a building to save energy. This works fine in a building where each room has its own switch (like in standard neural networks). However, in an RNN, the rooms (neurons) are connected in a loop, like a string of Christmas lights. If you randomly turn off lights (neurons) in this loop (dropout), the next set of lights doesn’t know whether it should be on or off because it depends on the previous lights in the sequence. This can create a sort of flickering effect (noise amplification) in the whole loop, making it difficult for the RNN to maintain a stable pattern of which lights should be on (which neurons should activate to predict the next part of the sequence)
Training objective for a neural machine translation (NMT) system
Utilizes a stacked LSTM architecture

-J_t: Objective function, often referred to as the loss function that the training process aims to minimize
In machine learning, especially in supervised learning, the objective function quantifies the error or the cost associated with the predictions of the model
The subscript t could denote a dependency on the parameters at a particular training time or iteration
-&Sigma: The summation symbol, indicating that we’re summing over a set of terms
Ref: https://arxiv.org/abs/1409.2329
-(x, y) ∈ D: Each (x, y) is a pair of sentences from parallel training corpus D
x is a sentence in the source language, and y is its corresponding translation in the target language
The parallel training corpus D consists of many such pairs, and the model learns to translate by learning from these examples
- $-\log p(y|x)$ :
This term is the negative logarithm of the probability that the model assigns to the correct translation y, given the source sentence x. The logarithm is used for numerical stability and to turn the multiplication of probabilities into a sum, as mentioned earlier
The negative sign indicates that we are looking to minimize this quantity—since the logarithm of a number between 0 and 1 is negative, minimizing the negative log probability is equivalent to maximizing the probability itself

3. Attention-Based Models

Two types of attention mechanisms used in neural machine translation (NMT) models: global and local attention
Both types are used to enhance the translation process by focusing on different parts of the input sentence when translating each word in the output sentence
-Global Attention:
All Source Positions: In global attention models, when generating each target word, the model considers the entire input sentence
It pays “attention” to all of the words in the source sentence, but not equally
It assigns different weights to each source word to determine their relevance to the word currently being predicted in the target language
-Local Attention:
A Few Source Positions: Local attention models, on the other hand, focus on only a subset of the source positions at a time
This means when translating a particular word, the model only looks at a small window of words around a particular point in the source sentence that it deems most relevant for the current target word
-Common Decoding Steps:
Both types of attention mechanisms operate in the context of a sequence-to-sequence model with an encoder-decoder architecture, often using stacked LSTMs
At each time step t during the decoding phase:
(1)Input Hidden State(h_t)
Both models first take the hidden state h_t from the top layer of a stacked LSTM decoder
(2)Context Vector(c_t)
They then use this hidden state to derive a context vector c_t, which is a dynamic representation of the input sentence capturing the information relevant to predicting the current target word
(3)Prediction of Current Target Word(y_t)
Despite their differences in deriving c_t, both global and local models use it in a similar manner afterward to help predict the current target word y_t
This typically involves combining c_t with h_t and other relevant information in a feedforward neural network to produce the probability distribution over the possible target words
The attention mechanism allows the model to dynamically focus on different parts of the input sentence, which is particularly useful for dealing with long input sequences and aligning parts of the input with the relevant parts of the output
-How to compute the attentional hidden state(h̃_t) in a neural machine translation system using an attention mechanism
AttentionBasedNMT11
h̃_t:
The attentional hidden state at time step t
It is a new representation that combines information from both the target hidden state and the source-side context vector
It serves as a refined summary that the model will use to predict the next word in the target sequence
tanh:
The hyperbolic tangent function, a type of activation function that squashes the input values to be within the range of -1 and 1
It helps to introduce non-linearity into the model, which is necessary for learning complex patterns
W_c:
A weight matrix that is learned during training
It is used to transform the concatenated vectors into the attentional hidden state
The dimensions of W_c would be set such that the multiplication with the concatenated vector results in a vector of the desired size for h̃_t
[ [c_t; h_t] ]:
The concatenation of the context vector(c_t) and the target hidden state(h_t)
The context vector(c_t) contains information about which parts of the input sentence are most relevant at this particular time step
while h_t contains information processed by the decoder up to the current time step
By concatenating them, the model brings together all the relevant information needed to focus on the correct parts of the input and make an accurate prediction for the next word
Equation (5) is showing how the model combines the current state of the decoder with the focused information from the input sentence (as provided by the attention mechanism) to form a vector that has all the information needed to predict the next word in the target sequence
This attentional hidden state becomes an integral part of the model’s decision-making for each subsequent word it generates in the translation process
AttentionBasedNMT10
Attentional vector h̃_t is then fed through the softmax layer to produce the predictive distribution

Global Attention

Consider all hidden states of the encoder when deriving the context vector(c_t)
-Alignment Vector(a_t(s))
For each word in the target sequence at time step t, the model computes an alignment vector which is a distribution over the source positions
The size of this vector equals the number of time steps in the source sentence
This vector essentially ‘aligns’ each target word with all of the source words, assigning a weight to each source word representing its importance in generating the current target word
h_t: the current target hidden state
h̅_s: A particular hidden state from the source sentence
The function align is defined as a softmax over a scoring function exp(score(h_t, h̅_s))
Scoring Function: The scoring function calculates a score that measures how well the inputs at positions t in the target and s in the source align with each other
There are different ways to define this scoring function; it could be a dot product or a neural network, for instance
AttentionBasedNMT12

Conclusion

What this Paper is about
Propose two simple and effective attentional mechanisms for neural machine translation
(1)Global approach: Looks at all source positions
(2)Local approach: One that only attends to a subset of source positions at a time
Proof
Experiment: Effectiveness of our models in the WMTtranslation tasks between English and German in both directions
Local attention: Yields large gains of up to 5.0 BLEU over non-attentional
Ensemble Model:
English to German translation direction, our model has established new state-of-the-art results for both WMT’14 and WMT’15, outperforming existing best systems
Surpassed the performance of the then-current best system by more than 1.0 BLEU which itself used Neural Machine Translation (NMT) enhanced with an n-gram reranker
Conclusion of the paper
Compared various alignment functions and provided insights on which functions are best for which attentional models
Attention-based NMT models are superior to nonattentional ones in many cases
Example: Translating names and handling long sentences

Footnote
alignment function (also known as a score function or compatibility function)
Calculates how much focus should be placed on each part of the input data when predicting a part of the output