BLEU Score for Evaluating Neural Machine Translation using Python

Using NMT or Neural Machine Translation in NLP, we can translate a text from a given language to a target language. To evaluate how well the translation is performed, we use the BLEU or Bilingual Evaluation Understudy score in Python.

The BLEU Score works by comparing machine translated sentences to human translated sentences, both in n-grams. Also, with the increase in sentence length, the BLEU score decreases. In general, a BLEU score is in the range from 0 to 1 and a higher value indicates a better quality. However, achieving a perfect score is very rare. Note that the evaluation is done on the basis of substring matching and it does not consider other aspects of language like coherence, tenses and grammar, etc.

Formula

The BLEU score is calculated using the following formula

BLEU = BP * exp(1/n * sum_{i=1}^{n} log(p_i))

Here, the various terms have the following meanings

  • BP is the Brevity Penalty. It adjusts the BLEU score based on the lengths of the two texts. Its formula is given by

BP = min(1, exp(1 - (r / c)))
  • n is the maximum order of n-gram matching

  • p_i is the precision score for i-gram

  • r is the reference length and c is the candidate length

Algorithm

  • Step 1 Import datasets library.

  • Step 2 Use the load_metric function with bleu as its parameter.

  • Step 3 Make a list out of the words of the translated string.

  • Step 4 Repeat step 3 with the words of the desired output string.

  • Step 5 Use bleu.compute to find the bleu value.

Example 1: Poor Translation Quality

In this example, we will use Python's datasets library to calculate the BLEU score for a German sentence machine translated to English

  • Source text (German) es regnet heute

  • Machine translated text it rain today

  • Desired text it is raining today, it was raining today

Although we can see that the translation is not done correctly, we can get a better view of the translation quality by finding the BLEU score ?

# Import the libraries
from datasets import load_metric
  
# Use the load_metric function
bleu = load_metric("bleu")

# Setup the predicted string
predictions = [["it", "rain", "today"]]

# Setup the desired string
references = [
   [["it", "is", "raining", "today"], 
   ["it", "was", "raining", "today"]]
]

# Print the values
result = bleu.compute(predictions=predictions, references=references)
print(result)
{'bleu': 0.0, 'precisions': [0.6666666666666666, 0.0, 0.0, 0.0], 'brevity_penalty': 0.7165313105737893, 'length_ratio': 0.75, 'translation_length': 3, 'reference_length': 4}

You can see that the translation done is not very good and thus, the BLEU score comes out to be 0.

Example 2: Better Translation Quality

In this example, we will again calculate the BLEU score. But this time, we will take a French sentence that is machine translated to English

  • Source text (French) nous partons en voyage

  • Machine translated text we going on a trip

  • Desired text we are going on a trip, we were going on a trip

You can see that this time, the translated text is much closer to the desired text. Let us check the BLEU score for it ?

# Import the libraries
from datasets import load_metric
  
# Use the load_metric function
bleu = load_metric("bleu")

# Setup the predicted string
predictions = [["we", "going", "on", "a", "trip"]]

# Setup the desired string
references = [
   [["we", "are", "going", "on", "a", "trip"], 
   ["we", "were", "going", "on", "a", "trip"]]
]

# Print the values
result = bleu.compute(predictions=predictions, references=references)
print(result)
{'bleu': 0.5789300674674098, 'precisions': [1.0, 0.75, 0.6666666666666666, 0.5], 'brevity_penalty': 0.8187307530779819, 'length_ratio': 0.8333333333333334, 'translation_length': 5, 'reference_length': 6}

You can see that this time, the translation done was quite close to the desired output and thus, the BLEU score is also higher than 0.5.

Understanding BLEU Score Components

The BLEU output provides several important metrics

  • bleu The final BLEU score (0 to 1)

  • precisions Precision scores for 1-gram, 2-gram, 3-gram, and 4-gram

  • brevity_penalty Penalty applied for shorter translations

  • length_ratio Ratio of translation length to reference length

Conclusion

BLEU Score is a useful tool to evaluate translation model efficiency by comparing n-gram precision with human references. While it provides a quick assessment, BLEU has limitations as it focuses on substring matching and ignores language nuances like grammar and coherence.

Updated on: 2026-03-27T11:19:31+05:30

740 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements