BLEU Score for Evaluating Neural Machine Translation using Python


Using NMT or Neural Machine Translation in NLP, we can translate a text from a given language to a target language. To evaluate how well the translation is performed, we use the BLEU or Bilingual Evaluation Understudy score in Python.

The BLEU Score works by comparing machine translated sentences to human translated sentences, both in n-grams. Also, with the increase in sentence length, the BLEU score decreases. In general, a BLEU score is in the range from 0 to 1 and a higher value indicates a better quality. However, achieving a perfect score is very rare. Note that the evaluation is done on the basis of substring matching and it does not consider other aspects of language like coherence, tenses and grammar, etc.

Formula

BLEU = BP * exp(1/n * sum_{i=1}^{n} log(p_i))

Here, the various terms have the following meanings −

  • BP is the Brevity Penalty. It adjusts the BLEU score based on the lengths of the two texts. Its formula is given by −

BP = min(1, exp(1 - (r / c)))
  • n is the maximum order of n-gram matching

  • p_i is the precision score

Algorithm

  • Step 1 − Import datasets library.

  • Step 2 − Use the load_metric function with bleu as its parameter.

  • Step 3 − Make a list out of the words of the translated string.

  • Step 4 − Repeat step 3 with the words of the desired output string.

  • Step 5 − Use bleu.compute to find the bleu value.

Example 1

In this example, we will use Python’s NLTK library to calculate the BLEU score for a german sentence machine translated to english.

  • Source text (German) − es regnet heute

  • Machine translated text − it rain today

  • Desired text − it is raining today, it was raining today

Although we can see that the translation is not done correctly, we can get a better view of the translation quality by finding the BLUE score.

Example

#import the libraries
from datasets import load_metric
  
#use the load_metric function
bleu = load_metric("bleu")

#setup the predicted string
predictions = [["it", "rain", "today"]]

#setup the desired string
references = [
   [["it", "is", "raining", "today"], 
   ["it", "was", "raining", "today"]]
]

#print the values
print(bleu.compute(predictions=predictions, references=references))

Output

{'bleu': 0.0, 'precisions': [0.6666666666666666, 0.0, 0.0, 0.0], 'brevity_penalty': 0.7165313105737893, 'length_ratio': 0.75, 'translation_length': 3, 'reference_length': 4}

You can see that the translation done is not very good and thus, the bleu score comes out to be 0.

Example 2

In this example, we will again calculate the BLEU score. But this time, we will take a French sentence that is machine translated to english.

  • Source text (German) − nous partons en voyage

  • Machine translated text − we going on a trip

  • Desired text − we are going on a trip, we were going on a trip

You can see that this time, the translated text is much closer to the desired text. Let us check the BLEU score for it.

Example

#import the libraries
from datasets import load_metric
  
#use the load_metric function
bleu = load_metric("bleu")

#steup the predicted string
predictions = [["we", "going", "on", "a", "trip"]]

#steup the desired string
references = [
   [["we", "are", "going", "on", "a", "trip"], 
   ["we", "were", "going", "on", "a", "trip"]]
]

#print the values
print(bleu.compute(predictions=predictions, references=references))

Output

{'bleu': 0.5789300674674098, 'precisions': [1.0, 0.75, 0.6666666666666666, 0.5], 'brevity_penalty': 0.8187307530779819, 'length_ratio': 0.8333333333333334, 'translation_length': 5, 'reference_length': 6}

You can see that this time, the translation done was quite close to the desired output and thus, the blue score is also higher than 0.5.

Conclusion

BLEU Score is a wonderful tool to check the efficiency of a translation model and thus, improve it further to produce better results. Although the BLEU score can be used to get a rough idea about a model, it is limited to a specific vocabulary and often ignores the nuances of language. This is why there is so little coordination of the BLEU score with human judgment. But there are some alternatives like ROUGE score, METEOR metric and CIDEr metric that you can definitely try.

Updated on: 07-Aug-2023

291 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements