Machine Translating(MT): Important sub-field of nlp that aims the translation of natural language sentences using computers
-From hand-crafted translation and linguistic knowledge, data-driven approaches to learn linguistic data from data(SMT) diverse approaches of machin translation existed
Neural Machine Translation(NMT): A state-of-the-art machine translation approach that utilizes neural network techniques to predict the likelihood of a set of words in sequence
-Translation Quality:
Due to NMT’s ability to consider the entire context of a sentence, as opposed to translating piece by piece, it tends to produce translations that are more fluent and accurate—the translations sound more like a native speaker and typically are closer to the intended meaning of the original text
-Memory Efficiency:
NMT uses neural networks, which, despite their complexity, can be more memory-efficient than the large statistical models of SMT
This is because NMT learns a dense representation of language rather than storing and retrieving vast tables of phrases and translations
-End-to-End Training:
NMT systems are trained end-to-end, which means that all parts of the model are trained simultaneously to optimize translation performance. In contrast, SMT systems involve several distinct models (such as language models, alignment models, and translation models) that are trained separately and then brought together in a pipeline
-Model Simplicity:
Traditional SMT systems are composed of many different sub-components, each requiring separate tuning and optimization (like translation rules, reordering models, language models, etc.)
This can make the system quite complex and cumbersome to manage. NMT simplifies this by using a single, large neural network that learns to perform the translation task from start to finish without needing to explicitly program all the different sub-tasks involved in translation
-Contextual Understanding:
NMT models, especially those using attention mechanisms or transformer architectures, are better at capturing long-range dependencies within a sentence
This means they can better understand how words relate to each other in a sentence, leading to translations that consider the entire input sequence as a whole rather than in isolated parts
Ref:
https://www.sciencedirect.com/science/article/pii/S2666651020300024
https://omniscien.com/faq/what-is-neural-machine-translation/
Encoder-Decoder Architecture
-Example: The cat ate the mouse -> Le chat a mange la souris
-Seq to Seq architecture
Consume sequences and splits out sequences
-Two Stages
Encoder Stage:
Produce vector, or representation of the architecture
Where the system takes in all the information it needs and packs it into a neat package. It’s like creating a summary or a blueprint of the information
Decoder stage:
Creates sequence
System takes that neat package and starts to build something with it, like a sequence of steps or instructions. It’s like using a blueprint to construct a model or telling a story based on a summary
Encoder that summarizes everything into a compact form, and then the Decoder that uses that summary to create a new sequence or output
Ref: https://www.youtube.com/watch?v=zbdong_h-x4
Attention Mechanism
-Traditional RNN Encoder Decoder
(1)Model takes one word at a time as input, updates the hidden state, and passes to the next time step
(2)Final hidden state is passed to the Decoder
(3)Decoder works with the final hidden state for processing and translates this to the target language
-Example:
Translate English sentence to French sentence
Encoder-Decoder structure, popular for translating sentences is used
Problem: Words in the source language do not align with the words in the target language
In the sentence Black cat eat the mouse, the first English words is Black, while the translated word in France is chat, meaning cat
-Solution: Attention Mechanism to the Encoder-Decoder structure
Allows neural network to focus specific parts of an input sequence
Done by assigning weights to different parts of the input sequence
Most important parts, receiving the highest weights
Problem:
The attention mechanism itself, became widely popular, allowing the model to focus on the most relevant parts of the input text while generating the translation
However, the specific ways in which attention mechanisms can be integrated into the architecture of NMT systems are diverse and not yet fully explored
There’s potential for innovative research to develop new models that incorporate attention in different ways, which could further improve translation performance
Solution(What is written in the paper)
Two types of attention mechanisms that can be used in Neural Machine Translation (NMT) systems
(1)Global Attention
Considers the entire input sentence at once when translating any part of it
No matter which word the system is currently translating, it has the whole original sentence to draw on for context
This is like having an overview of the entire landscape when taking a photograph; you can see everything from the start
(2)Local Attention
Focuses on just a part of the input sentence at a time
When the system translates a specific word, it only looks at a few nearby source words for guidance
This is akin to taking a close-up photo of a subject, where you only focus on a small area and blur out the rest
=> Both methods of MT work well
Proof:
Researchers tested their translation methods from English to German and from German to English and the results show high performance
Performance:
Local attention: Achieved a significant gain of 5.0 BLEUpoints over non-attentional systems that already incorporate known techniques such as dropout
Ensemble model:
Use different attention architectures
Yields a new state-of-the-art result in the WMT’15 English to German translation task with 25.9 BLEU points
An improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker
Footnote:
BLEU
Single numeric score that tells us how good it is compared to reference translation
Geometric mean of all four n-gram precisions
-Unigram Precision = Num word matches / Num words in generation
-Clip is used to clip the number of times to count a word, based on the reference translation based on the number of times it appears in the reference translation, to avoid overgenerating reasonable words
-To solve the word ordering problems, BLEU computes precision for several different n-grams and averages the result
Ref: https://www.youtube.com/watch?v=M05L1DhFqcw
**What makes NMT **
Neural Machine Translation System: Neural network that directly models the conditional probability $p(y|x)$ of translating a source sentence $x1,…, xn$ to a target sentence $y1,…, ym$
Two Components of basic form of NFT
(a) Encoder: Computes a representation s for each source sentence
-The encoder processes the source sentence and generates a context-rich vector representation
-This representation is meant to capture the entire meaning of the source sentence, containing all the necessary information to generate the target sentence
(b) Decoder: Generates one target word at a time and hence decomposes the conditional probability
-The decoder takes the sentence representation s and starts generating the target sentence one word at a time
-The probability of the entire target sentence $p(y∣x)$ is the product of the probabilities of each word given all the previous words and the source sentence
-Each $p(y_j | y_<j, s)$ represents the probability of the j-th word given the sentence representations and all the previous words
-The decomposition makes the computation tractable since you can now generate each word sequentially and multiply the probabilities (or sum the logarithms of the probabilities for numerical stability)
Mathematically, for a target sentence y with words y1, y2, …, ym the joint probability can be expressed as:
-Using the logarithm in the probability equation has several advantages in computational term
Numerical Stability:
Probabilities can be very small numbers, and multiplying many such small probabilities (as one would do to compute the joint probability of a sequence of words) can lead to numerical underflow, where the numbers become too small for the computer to represent accurately
Additive Property: Logarithms turn products into sums, which is computationally more convenient. In machine learning algorithms, especially those that involve optimization, it’s easier to work with sums than with products because sums tend to have nicer analytical properties. Gradients of sums, for instance, are easier to compute than gradients of products
-Most of the recent NMT works has a natural choice to model such a decomposition in the decoder is to use recurrent neural network (RNN) architecture
They, however, differ in terms of which RNN architectures are used for the decoder and how the encodercomputes the source sentence representation s
Decoder’s operation in an NMT system
$-\log p(y|x)$
: Two types of attention mechanisms used in neural machine translation (NMT) models: global and local attention
Both types are used to enhance the translation process by focusing on different parts of the input sentence when translating each word in the output sentence
-Global Attention:
All Source Positions: In global attention models, when generating each target word, the model considers the entire input sentence
It pays “attention” to all of the words in the source sentence, but not equally
It assigns different weights to each source word to determine their relevance to the word currently being predicted in the target language
-Local Attention:
A Few Source Positions: Local attention models, on the other hand, focus on only a subset of the source positions at a time
This means when translating a particular word, the model only looks at a small window of words around a particular point in the source sentence that it deems most relevant for the current target word
-Common Decoding Steps:
Both types of attention mechanisms operate in the context of a sequence-to-sequence model with an encoder-decoder architecture, often using stacked LSTMs
At each time step t during the decoding phase:
(1)Input Hidden State(ht)
Both models first take the hidden state ht from the top layer of a stacked LSTM decoder
(2)Context Vector(ct)
They then use this hidden state to derive a context vector ct, which is a dynamic representation of the input sentence capturing the information relevant to predicting the current target word
(3)Prediction of Current Target Word(yt)
Despite their differences in deriving ct, both global and local models use it in a similar manner afterward to help predict the current target word yt
This typically involves combining ct with ht and other relevant information in a feedforward neural network to produce the probability distribution over the possible target words
The attention mechanism allows the model to dynamically focus on different parts of the input sentence, which is particularly useful for dealing with long input sequences and aligning parts of the input with the relevant parts of the output
-How to compute the attentional hidden state(h̃t) in a neural machine translation system using an attention mechanism
h̃t:
The attentional hidden state at time step t
It is a new representation that combines information from both the target hidden state and the source-side context vector
It serves as a refined summary that the model will use to predict the next word in the target sequence
tanh:
The hyperbolic tangent function, a type of activation function that squashes the input values to be within the range of -1 and 1
It helps to introduce non-linearity into the model, which is necessary for learning complex patterns
Wc:
A weight matrix that is learned during training
It is used to transform the concatenated vectors into the attentional hidden state
The dimensions of Wc would be set such that the multiplication with the concatenated vector results in a vector of the desired size for h̃t
[ [c_t; h_t] ]:
The concatenation of the context vector(ct) and the target hidden state(ht)
The context vector(ct) contains information about which parts of the input sentence are most relevant at this particular time step
while ht contains information processed by the decoder up to the current time step
By concatenating them, the model brings together all the relevant information needed to focus on the correct parts of the input and make an accurate prediction for the next word
Equation (5) is showing how the model combines the current state of the decoder with the focused information from the input sentence (as provided by the attention mechanism) to form a vector that has all the information needed to predict the next word in the target sequence
This attentional hidden state becomes an integral part of the model’s decision-making for each subsequent word it generates in the translation process
Attentional vector h̃t is then fed through the softmax layer to produce the predictive distribution
Consider all hidden states of the encoder when deriving the context vector(ct)
-Alignment Vector(at(s))
For each word in the target sequence at time step t, the model computes an alignment vector which is a distribution over the source positions
The size of this vector equals the number of time steps in the source sentence
This vector essentially ‘aligns’ each target word with all of the source words, assigning a weight to each source word representing its importance in generating the current target word
ht: the current target hidden state
h̅s: A particular hidden state from the source sentence
The function align is defined as a softmax over a scoring function exp(score(ht, h̅s))
Scoring Function: The scoring function calculates a score that measures how well the inputs at positions t in the target and s in the source align with each other
There are different ways to define this scoring function; it could be a dot product or a neural network, for instance
What this Paper is about
Propose two simple and effective attentional mechanisms for neural machine translation
(1)Global approach: Looks at all source positions
(2)Local approach: One that only attends to a subset of source positions at a time
Proof
Experiment: Effectiveness of our models in the WMTtranslation tasks between English and German in both directions
Local attention: Yields large gains of up to 5.0 BLEU over non-attentional
Ensemble Model:
English to German translation direction, our model has established new state-of-the-art results for both WMT’14 and WMT’15, outperforming existing best systems
Surpassed the performance of the then-current best system by more than 1.0 BLEU which itself used Neural Machine Translation (NMT) enhanced with an n-gram reranker
Conclusion of the paper
Compared various alignment functions and provided insights on which functions are best for which attentional models
Attention-based NMT models are superior to nonattentional ones in many cases
Example: Translating names and handling long sentences
Footnote
alignment function (also known as a score function or compatibility function)
Calculates how much focus should be placed on each part of the input data when predicting a part of the output