Language:
Precautions | Anuvadika

Precautions

Machine Translation systems are software applications designed to translate speech or text from one language to another. This is done using a machine which has been trained to translate several languages from and to each other. They work on computational and linguistic sets of rules, statistincs and heuristics as drawn from various sets of inputs, including a large set of parallely aligned translated text corpus, without any human intervention.

As the process involved is mainly machine learning for a set of examples and deducing from it, there are always a chance of error. The error percentage may differ from one language pair to another and from domain to domain. While a given language pair may work very well for one language pair (for example English to Hindi), for others it may be very bad, verging to being simply non-acceptable.

Machine translation tools are evaluated based on several computational metrics as accepted by the NLP academic and research community. The current tool hosted here also has gone through the same kind of evaluation and the results of evaluation as reported for the engine deployed here is given in the table below.

En-Indic Indic-En
LANGUAGES BLEU chrF++ COMET BLEU chrF++ COMET
Bengali 12.6 42.4 87.9 25.7 52.5 87.1
Gujarati 23.9 52.1 92.1 38.8 62.9 90.8
Hindi 37.6 59.7 86.1 41.6 65.1 90.7
Kannada 16.7 50.9 90.1 36.3 60.3 88.7
Malayalam 13.7 49.2 90.9 33.6 58.3 89.3
Marathi 18.1 49 80.9 32.2 57.1 88.1
Oriya 13.6 44.2 83.5 32.7 56.8 88.4
Punjabi 31.1 54.2 88.9 41.5 64.8 90.2
Tamil 12.3 47.5 91.9 31.1 55.6 87.4
Telugu 12 45.3 86.4 34.4 59.6 88.7
AVERAGE 19.16 49.45 87.87 34.79 59.3 88.94

The table above shows the evaluation scores for our IndicTrans2 software calculated by the metrics BLEU, chrF++ and COMET. The scores have been calculated against the WAT 2021 Benchmark containing datasets for Indian languages. The scores are out of 100 and are given for translations from English to Indian languages and back translations (i.e. from Indian languages to English). For more information on these metrics, refer to the descriptions below.

BLEU

BLEU stands for Bilingual Evaluation Understudy. It measures how good a machine-translated text is compared to a human translation. Kishore Papineni and his team at IBM developed BLEU in 2002. Since then, it has become one of the most popular methods for judging the quality of machine translations.

BLEU gives a quick and objective way to see how good different machine translation systems are. Before BLEU, people had to read and compare translations by hand, which took time and could have been more consistent.

  1. Checking Matches: BLEU looks at smaller parts (n-grams) of the machine’s translation and checks how many appear in the human translations. The more matches it finds, the better the translation. (Papineni et al., 2002)
  2. Considering Length: BLEU also checks the length of the translation. If the machine translation is shorter, it gets a penalty to prevent it from scoring too high unfairly because it might have left out important information. (Papineni et al., 2002)
  3. Final Score: The score is between 0 and 1. A score of 1 means the translation is perfect, and a score closer to 0 means it needs improvement. Since no translation is perfect, scores are usually below 1. (Papineni et al., 2002)

Limitations of BLEU

chrF++

chrF++ evaluates machine translation quality by looking at character-level matches. Unlike BLEU, which mainly focuses on word-level matches, chrF++ considers finer details by looking at sequences of characters (n-grams). This makes chrF++ especially useful for languages with complex word forms or translations where small differences matter.

  1. Character n-grams: chrF++ breaks translations into sequences of characters. For example, in the word "translation," character 3-grams would be "tra," "ran," "ans," and so on. This captures subtle differences in spelling and word forms that word-level metrics might miss. (Popović, 2017).
  2. Word n-grams: Besides character n-grams, chrF++ also looks at word-level n-grams. This dual approach balances detailed character-level analysis with the broader context of word-level analysis.
  3. Precision and Recall: chrF++ calculates precision (how many n-grams from the machine translation are in the reference translation) and recall (how many n-grams from the reference translation are in the machine translation). This provides a balanced evaluation. (Popović, 2017).
  4. F++-score: The final score is the harmonic mean of precision and recall. This score ranges from 0 to 100, where 100 means a perfect match between the machine and reference translations.

Limitations of chrF++

COMET

Cross-lingual Optimised Metric for Evaluation of Translation uses machine learning to assess the quality of machine translations. Unlike traditional metrics like BLEU and chrF++, which rely on n-grams, COMET uses pre-trained language models to understand better and evaluate translations based on meaning and context.

  1. Pre-trained Language Models: COMET uses models like BERT (Bidirectional Encoder Representations from Transformers) to encode the machine-translated text and the reference translation into high-dimensional vectors. These vectors capture semantic information and context beyond just word matches. (Rei et al., 2020)
  2. Cross-Lingual Features: COMET uses models trained in multiple languages, which helps it handle different linguistic structures and nuances better. This makes it robust for evaluating translations across various language pairs.
  3. Training with Human Judgments: COMET is trained on datasets with human judgement scores for translations. This training helps the model learn what humans consider high-quality translations, making its evaluations more accurate. (Rei et al., 2022).
  4. Scoring: The final score from COMET reflects how well the machine translation matches the meaning and context of the reference translation. Scores typically range from 0 to 1, with higher scores indicating better-quality translations. (Rei et al., 2020)

Limitations of COMET

References