Language

About Us | Anuvadika

About Us

Anuvadika is the CIIL machine translation tool that integrates translation models, transliteration engines, ASR engines and TTS engines together at a single platform. Users can translate text from any Indian language (including English) to any Indian language. This web application also supports transliteration of both source and target text. For ease of access, we have also implemented language detection module to identify the language of the source text automatically. However, users can still choose the language they want to translate from.

In sections below, we provide a brief of the different modules that have been integrated to make this platform.

Translation Engine

The LDC-IL Translation Tool is built on the latest Bhashini Models, using various kinds of parallel corpora and language corpora, including all the LDC-IL text corpora.

Specifically, this platform uses the IndicTrans2 (Gala et.al, 2023) from AI4Bharat, a Research Lab at IIT Madras, supported by TDIL, NLTM and LDC-IL.

The the trained models for ai4bharat are available at https://github.com/AI4Bharat/IndicTrans2. It is a transformer-based multilingual NMT model that supports high-quality translations across all the 22 scheduled Indic languages, including English and multiple scripts for a few Indian languages.

There are 3 models (en-indic, indic-indic, indic-en) that are 1.1B parameter models, of which we use the distilled 600M param model in their optimized CTranslate2 formats for faster inference.

Further, we internally use a unified API that routes inference to the appropriate model upon user request. While facilitating dynamic batching for higher throughput.

The following are the supported languages that can be used for n x n translations:

  • Assamese
  • Bengali
  • Bodo
  • Dogri
  • English
  • Gujarati
  • Hindi
  • Kannada
  • Kashmiri (Arabic)
  • Kashmiri (Devanagari)
  • Konkani
  • Maithili
  • Malayalam
  • Manipuri (Bengali)
  • Manipuri (Meetei-Mayek)
  • Marathi
  • Nepali
  • Odia
  • Punjabi
  • Sanskrit
  • Santali
  • Sindhi (Arabic)
  • Sindhi (Devanagari)
  • Tamil
  • Telugu
  • Urdu

If you use this tool for any research purpose, please refer to LDC-IL by citing this paper:

"Choudhary, N. 2021. LDC-IL: The Indian Repository of Resources for Language Technology. Language Resources & Evaluation. Springer, Vol. 55, Issue 1"

Text based Language Detection Model

The translation tool also deploys a language detection engine to automatically detect the language of the given source text. For this purpose, we use a language detection engine from facebookAI named "lid218e" (Joulin, Armand et.al. 2016). It was released as part of the NLLB project and can detect 217 languages. It's built on fastText, a library for efficient learning of word representations and sentence classification. Languages detected besides these are also detected and classified as unknown language and their lang ISO code is shown.

The following are the list of languages supported that the model can detect, of course at times with some errors:

  • Hindi
  • Urdu
  • Sindhi
  • Bengali
  • Punjabi
  • Gujarati
  • Telugu
  • Tamil
  • Nepali
  • Marathi
  • Kannada
  • Malayalam
  • Assamese
  • Odia
  • Maithili
  • Sindhi
  • Konkani
  • Magahi
  • Chhattisgarhi
  • Kashmiri (Devanagari)
  • Bhojpuri (Devanagari)
  • Manipuri (Bengali)
  • Santali
  • Awadhi
  • English

Transliteration Engines

The transliteration engine is a hybrid system. For the Brahmi based scripts, we use a rule based mechanism that provides a near one to one mapping of the Indian characters, which are all abugidas. We use our internal transliteration engine as available at Lipyantara

However, for English and Urdu, the portal uses a statistical transliteration model from IIT-H by Jay Gala et.al. that aims on adding a state-of-the-art transliteration module for cross transliterations among all Indian languages including english.

The following is the list of scripts supported through this mechanism:

  • Devanagari, Bangla/Assamese, Gujarati, Gurumukhi (Punjabi), Malayalam, Kannada, Tamil, Telugu, Oriya, Urdu, English

  • ASR Engine

    The ASR model pooled in from Facebook/Meta (Pratap et.al) from their project named Massively Multilingual Speech (MMS) project. The architecture uses modularised adapter modules that clip on to the main architecture of the ASR model, giving a more robust and accurate multilingual performance. Additionally, the inference pipeline allows the model to infer incoming audio requests above its 30s limit and scaling inference requests to multiple hours. The model is hosted using flash attention 2 on a custom written fastAPI inference pipeline that has dynamic batching and queue based multiprocessor.

    The ASR model supports inference for the following 20 Indian languages:

    Assamese, Awadhi, Bengali, Bodo Parja, Haryanvi, Gujarati, Hindi, Kashmiri, Konkani, Maithili, Malayalam, Marathi, Meitei, Marwari, Sindhi, Tamil, Telugu, Urdu, English, Kannada


    TTS Engine

    

    For the TTS output, the model we use is a Multilingual Multi-speaker TTS as developed by the team of Prof. Hema Murthy and S. Umesh at IIT Madras. This repository contains a Fastspeech2 Model for 14 Indian languages (male and female both) implemented using the Hybrid Segmentation (HS) for speech synthesis. The model is capable of generating mel-spectrograms from text inputs and can be used to synthesise speech.

    SUPPORTED LANGUAGES AND GENDERS:
  • Hindi (Male, Female)
  • Malayalam (Male, Female)
  • Manipuri (Male)
  • Marathi (Male, Female)
  • Kannada (Male, Female)
  • Bodo (Female)
  • English (Male, Female)
  • Assamese (Male, Female)
  • Tamil (Male, Female)
  • Odia (Male, Female)
  • Rajasthani (Male, Female)
  • Telugu (Male, Female)
  • Bengali (Male, Female)
  • Gujarati (Male, Female)