Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
BLEU Score for Evaluating Neural Machine Translation using Python
Using NMT or Neural Machine Translation in NLP, we can translate a text from a given language to a target language. To evaluate how well the translation is performed, we use the BLEU or Bilingual Evaluation Understudy score in Python.
The BLEU Score works by comparing machine translated sentences to human translated sentences, both in n-grams. Also, with the increase in sentence length, the BLEU score decreases. In general, a BLEU score is in the range from 0 to 1 and a higher value indicates a better quality. However, achieving a perfect score is very rare. Note that the evaluation is done on the basis of substring matching and it does not consider other aspects of language like coherence, tenses and grammar, etc.
Formula
The BLEU score is calculated using the following formula
BLEU = BP * exp(1/n * sum_{i=1}^{n} log(p_i))
Here, the various terms have the following meanings
BP is the Brevity Penalty. It adjusts the BLEU score based on the lengths of the two texts. Its formula is given by
BP = min(1, exp(1 - (r / c)))
n is the maximum order of n-gram matching
p_i is the precision score for i-gram
r is the reference length and c is the candidate length
Algorithm
Step 1 Import datasets library.
Step 2 Use the load_metric function with bleu as its parameter.
Step 3 Make a list out of the words of the translated string.
Step 4 Repeat step 3 with the words of the desired output string.
Step 5 Use bleu.compute to find the bleu value.
Example 1: Poor Translation Quality
In this example, we will use Python's datasets library to calculate the BLEU score for a German sentence machine translated to English
Source text (German) es regnet heute
Machine translated text it rain today
Desired text it is raining today, it was raining today
Although we can see that the translation is not done correctly, we can get a better view of the translation quality by finding the BLEU score ?
# Import the libraries
from datasets import load_metric
# Use the load_metric function
bleu = load_metric("bleu")
# Setup the predicted string
predictions = [["it", "rain", "today"]]
# Setup the desired string
references = [
[["it", "is", "raining", "today"],
["it", "was", "raining", "today"]]
]
# Print the values
result = bleu.compute(predictions=predictions, references=references)
print(result)
{'bleu': 0.0, 'precisions': [0.6666666666666666, 0.0, 0.0, 0.0], 'brevity_penalty': 0.7165313105737893, 'length_ratio': 0.75, 'translation_length': 3, 'reference_length': 4}
You can see that the translation done is not very good and thus, the BLEU score comes out to be 0.
Example 2: Better Translation Quality
In this example, we will again calculate the BLEU score. But this time, we will take a French sentence that is machine translated to English
Source text (French) nous partons en voyage
Machine translated text we going on a trip
Desired text we are going on a trip, we were going on a trip
You can see that this time, the translated text is much closer to the desired text. Let us check the BLEU score for it ?
# Import the libraries
from datasets import load_metric
# Use the load_metric function
bleu = load_metric("bleu")
# Setup the predicted string
predictions = [["we", "going", "on", "a", "trip"]]
# Setup the desired string
references = [
[["we", "are", "going", "on", "a", "trip"],
["we", "were", "going", "on", "a", "trip"]]
]
# Print the values
result = bleu.compute(predictions=predictions, references=references)
print(result)
{'bleu': 0.5789300674674098, 'precisions': [1.0, 0.75, 0.6666666666666666, 0.5], 'brevity_penalty': 0.8187307530779819, 'length_ratio': 0.8333333333333334, 'translation_length': 5, 'reference_length': 6}
You can see that this time, the translation done was quite close to the desired output and thus, the BLEU score is also higher than 0.5.
Understanding BLEU Score Components
The BLEU output provides several important metrics
bleu The final BLEU score (0 to 1)
precisions Precision scores for 1-gram, 2-gram, 3-gram, and 4-gram
brevity_penalty Penalty applied for shorter translations
length_ratio Ratio of translation length to reference length
Conclusion
BLEU Score is a useful tool to evaluate translation model efficiency by comparing n-gram precision with human references. While it provides a quick assessment, BLEU has limitations as it focuses on substring matching and ignores language nuances like grammar and coherence.
