Precautions

Machine Translation systems are software applications designed to translate speech or text from one language to another. This is done using a machine which has been trained to translate several languages from and to each other. They work on computational and linguistic sets of rules, statistincs and heuristics as drawn from various sets of inputs, including a large set of parallely aligned translated text corpus, without any human intervention.

As the process involved is mainly machine learning for a set of examples and deducing from it, there are always a chance of error. The error percentage may differ from one language pair to another and from domain to domain. While a given language pair may work very well for one language pair (for example English to Hindi), for others it may be very bad, verging to being simply non-acceptable.

Machine translation tools are evaluated based on several computational metrics as accepted by the NLP academic and research community. The current tool hosted here also has gone through the same kind of evaluation and the results of evaluation as reported for the engine deployed here is given in the table below.

	En-Indic			Indic-En
LANGUAGES	BLEU	chrF++	COMET	BLEU	chrF++	COMET
Bengali	12.6	42.4	87.9	25.7	52.5	87.1
Gujarati	23.9	52.1	92.1	38.8	62.9	90.8
Hindi	37.6	59.7	86.1	41.6	65.1	90.7
Kannada	16.7	50.9	90.1	36.3	60.3	88.7
Malayalam	13.7	49.2	90.9	33.6	58.3	89.3
Marathi	18.1	49	80.9	32.2	57.1	88.1
Oriya	13.6	44.2	83.5	32.7	56.8	88.4
Punjabi	31.1	54.2	88.9	41.5	64.8	90.2
Tamil	12.3	47.5	91.9	31.1	55.6	87.4
Telugu	12	45.3	86.4	34.4	59.6	88.7
AVERAGE	19.16	49.45	87.87	34.79	59.3	88.94

The table above shows the evaluation scores for our IndicTrans2 software calculated by the metrics BLEU, chrF++ and COMET. The scores have been calculated against the WAT 2021 Benchmark containing datasets for Indian languages. The scores are out of 100 and are given for translations from English to Indian languages and back translations (i.e. from Indian languages to English). For more information on these metrics, refer to the descriptions below.

BLEU

BLEU stands for Bilingual Evaluation Understudy. It measures how good a machine-translated text is compared to a human translation. Kishore Papineni and his team at IBM developed BLEU in 2002. Since then, it has become one of the most popular methods for judging the quality of machine translations.

BLEU gives a quick and objective way to see how good different machine translation systems are. Before BLEU, people had to read and compare translations by hand, which took time and could have been more consistent.

Checking Matches: BLEU looks at smaller parts (n-grams) of the machine’s translation and checks how many appear in the human translations. The more matches it finds, the better the translation. (Papineni et al., 2002)
Considering Length: BLEU also checks the length of the translation. If the machine translation is shorter, it gets a penalty to prevent it from scoring too high unfairly because it might have left out important information. (Papineni et al., 2002)
Final Score: The score is between 0 and 1. A score of 1 means the translation is perfect, and a score closer to 0 means it needs improvement. Since no translation is perfect, scores are usually below 1. (Papineni et al., 2002)

Limitations of BLEU

Though the model provides insight into the macro-level performance of the translation tool, it is not very suitable for fine-grained and micro-level evaluations (Gala et al., 2023).
Preprocessing cleans the input text before translation to remove white spaces and junk and standardise the input format. The BLEU score differs for differently preprocessed texts and is incomparable (Post, 2018).
The BLEU score relies on how words are split, and different ways of splitting words can lead to different scores. Thus, this drawback highlights the need for a more standardised and robust method of splitting words (Gala et al., 2023).
The BLEU score increases with the number of reference translations provided. Hence, one must be cautious and refrain from giving rough reference translations (Papineni et al., 2002).
The BLEU score does not measure meaning but pays more attention to word combinations, leading to disregarding semantic meaning (Tatman, 2019).
The BLEU score also heavily penalises synonyms if they are not listed in the reference translations or are treated as unknown if they do not occur at least twice in the test set (Post, 2018). (Tatman, 2019).
The BLEU score also disregards the syntactic structure and the ordering of the word combinations. This sometimes leads to good and bad translations getting the same BLEU score (Tatman, 2019).
For Indian languages that are agglutinative and morphologically rich, BLEU can penalise the morphological units, such as suffixes, if they do not match with the reference translation even though the meaning is conveyed correctly. Thus, changes should be made to store a repository of morphemes to ensure an accurate BLEU score (Tatman, 2019).
The researchers supplying the datasets often do not label which version they have used and the information that is added or lost. This leads to discrepancies or irregularities and contributes to a variance in the BLEU scores (Post, 2018).

chrF++

chrF++ evaluates machine translation quality by looking at character-level matches. Unlike BLEU, which mainly focuses on word-level matches, chrF++ considers finer details by looking at sequences of characters (n-grams). This makes chrF++ especially useful for languages with complex word forms or translations where small differences matter.

Character n-grams: chrF++ breaks translations into sequences of characters. For example, in the word "translation," character 3-grams would be "tra," "ran," "ans," and so on. This captures subtle differences in spelling and word forms that word-level metrics might miss. (Popović, 2017).
Word n-grams: Besides character n-grams, chrF++ also looks at word-level n-grams. This dual approach balances detailed character-level analysis with the broader context of word-level analysis.
Precision and Recall: chrF++ calculates precision (how many n-grams from the machine translation are in the reference translation) and recall (how many n-grams from the reference translation are in the machine translation). This provides a balanced evaluation. (Popović, 2017).
F++-score: The final score is the harmonic mean of precision and recall. This score ranges from 0 to 100, where 100 means a perfect match between the machine and reference translations.

Limitations of chrF++

The chrF scores are more accurate and closer to human ratings for systems with high translation quality (Popović, 2017).
The chrF++ scores can sometimes rate nonsensical translations as precise or closer to human translations if the combinations of two words are present in the reference translations(Popović, 2017).
Like the BLEU score, the chrF++ also focuses on word combinations to rate the translation (Popović, 2017).; this disregards the meaning and contextual information that needs to be understood for the translation to be evaluated.
The lack of interpretability exists across metrics as it is not directly inferable from the scores, which are parameters in the translation that require improvement.
The chrF++ scores have not been exhaustively tested on Indian languages, which are exceptionally morphologically rich (Popović, 2017). This creates a concern regarding the reliability of the chrF++ score for translation software dealing with Indian languages.

COMET

Cross-lingual Optimised Metric for Evaluation of Translation uses machine learning to assess the quality of machine translations. Unlike traditional metrics like BLEU and chrF++, which rely on n-grams, COMET uses pre-trained language models to understand better and evaluate translations based on meaning and context.

Pre-trained Language Models: COMET uses models like BERT (Bidirectional Encoder Representations from Transformers) to encode the machine-translated text and the reference translation into high-dimensional vectors. These vectors capture semantic information and context beyond just word matches. (Rei et al., 2020)
Cross-Lingual Features: COMET uses models trained in multiple languages, which helps it handle different linguistic structures and nuances better. This makes it robust for evaluating translations across various language pairs.
Training with Human Judgments: COMET is trained on datasets with human judgement scores for translations. This training helps the model learn what humans consider high-quality translations, making its evaluations more accurate. (Rei et al., 2022).
Scoring: The final score from COMET reflects how well the machine translation matches the meaning and context of the reference translation. Scores typically range from 0 to 1, with higher scores indicating better-quality translations. (Rei et al., 2020)

Limitations of COMET

COMET was initially trained on datasets that did not include Indian languages (Rei et al., 2022). Thus, this can lead to a decrease in the overall performance of the metric.
Another challenge COMET faces is the time it takes to evaluate the accuracy of the translation software (Rei et al., 2022).
A standard must be developed to balance the source and reference text’s weight (Rei et al., 2022).
COMET score can sometimes overlook translation errors such as translating proper nouns, changing the sentence’s sentiment, and incorrectly translating numbers (Rei et al., 2022).
COMET metric is trained upon human analysis and annotation of the datasets, which can cause subjectivity and may differ across language pairs (Rei et al., 2022).
Since the COMET metric works on neural networking, which aims to mimic the human brain, it is very crucial to keep updating the datasets for a more accurate translation.

References

Gala, J., Chitale, P. A., AK, R., Gumma, V., Doddapaneni, S., Kumar, A., Nawale, J., Sujatha, A., Puduppully, R., Raghavan, V., Kumar, P., Khapra, M. M., Dabre, R., Nilekani Centre at AI4Bharat, Indian Institute of Technology Madras, Microsoft, EkStep Foundation, National Institute of Information and Communications Technology, Kyoto, Japan, Flipkart, & Institute for Infocomm Research (I2R), A*STAR, Singapore. (2023). IndicTrans2: Towards high-quality and accessible machine translation models for all 22 scheduled Indian languages. Transactions on Machine Learning Research. Retrieved from https://openreview.net/forum?id=vfT4YuzAYA
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. IBM T. J. Watson Research Center. Retrieved from https://aclanthology.org/P02-1040.pdf
Popović, M. (2017). CHRF++: Words helping character n-grams. In Proceedings of the Conference on Machine Translation (WMT), Volume 2: Shared Task Papers (pp. 612–618). https://doi.org/10.18653/v1/W17-4770
Post, M. (2018). A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation (WMT), Volume 1: Research Papers (pp. 186–191). Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-64019
Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 2685–2702). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.213
Rei, R., Gomes, J., Alves, D., Zerva, C., Farinha, A., Glushkova, T., Lavie, A., Coheur, L., & Martins, A. (2022). COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT) (pp. 578–585). Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2022.wmt-1.52
Tatman, R. (2019, December 7). Evaluating text output in NLP: BLEU at your own risk. Medium. https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213