A Corpus Based N-gram Hybrid Approach of Bengali to English Machine Translation

May 2, 2022 by Essay Writer

Abstract

Machine translation means automatic translation which is performed by computer software. Although there are several approaches of machine translation, some of them require extensive linguistic knowledge while some oblige huge statistical calculations. Hence, this paper introduces a hybrid methodology integrating corpus based approach and statistical approach for translating Bengali sentences into English with the help of N-gram language model. The corpus based approach finds the corresponding target translation, selecting the best match text from the bilingual corpus to acquire knowledge while the n-gram model rearranges the sentence constituents to get accurate translation without employing any external linguistic rules. A variety of Bengali sentences of various structures and verb tenses are considered to be translated. The performance of the proposed system is evaluated in terms of WER, BLEU and F-measure, along with other conventional singleton approaches as well as Google Translate, a well-known machine translation service by Google. It has been found that experimental results of this work provide higher accuracy of 0.87 BLEU score over Google Translate and other methods.

Introduction

Machine Translation, abbreviated as MT, pertains to the application of computers to automate some or all the processes of transforming text between any pairs natural human languages preserving the meaning and interpretation of both source and target languages [1]. It is a genesis of Natural Language Processing (NLP) and Computational Linguistics (CL). Though numerous researches have conducted in this area, it is still a challenging job to produce a completely automated translation machine. Verily, human languages are complex in practical with versatile characteristics. The major barriers for translating human languages by computers are: Word order: different languages follow different order of sentence constituents; word sense ambiguity: same words and phrases have different meanings; syntactic complexity: sentences are often conducted by anomalous grammar rules; lexical variance: a word in one language is to be expressed by group of words in another; elliptical and ungrammatical construction of sentences. So far, researches are being conducted to overcome these shortcomings.

At present, different types of methods are used for machine translation, such as— direct, transfer, interlingua, corpus based, statistical approach etc. In this paper, a new approach has been proposed for Bengali to English automatic translation. The new method blends the idea of corpus based approach and n-gram language model of IBM.

The rest of the paper is organized as follows: Section II reviews some previous researches on this topic. Some core machine translation approaches are discussed in section III. Section IV describes the new proposed hybrid approach with complexity analysis and Section V illustrates the experimental result including the corpus and comparative study. Finally, Section VI concludes the paper with some future directions.

Related researches

Bengali, also known by its endonym Bangla (বাংলা), is the sixth most spoken language in the world by population. In approximate, 250 million speakers are there worldwide in this language. Unfortunately, quite a few research works have explored in developing machine translation software which uses Bengali as the source language and English as the target language. Most of the Bengali MT systems proceed towards English to Bengali translation. Several rule based and statistical approaches have been explored for English-Bengali translation. Moreover, researchers are more concerned about verb tenses rather than sentence types— simple, complex, compound. This research takes the fact into consideration and works on both sentence types and tense.

Reference introduces new parameters of statistical machine translation (SMT) along with the existing parameters to translate complex Bengali sentences. In , a rule based approach is initiated considering the influences of verb and case in Bengali assertive and interrogative sentences. Later on, reference develops a transfer based algorithm to correspond with meaning and context of Bengali sentences. A framework is designed using context sensitive grammar rules in. Another empirical framework is modeled in to translate imperative, optative and exclamatory Bengali texts. In the meantime, some researchers attempt to build systems for Bengali to other language translation except English. Reference presents an architecture integrating transfer method and statistical machine translation for Bengali to Hindi translation. A system to translate Bengali texts to Assamese is described in utilizing Moses (a tool for MT).

Core Machine Translation Approaches

The ideas and techniques of machine translation involve linguistics, computer science, artificial intelligence, automata theory, translation theory and statistics. Different approaches are applied to automate translation. Some of core methods are discussed in this section.

MT methods can be classified into two main categories: 1) Rule based and 2) Example based. Rule based MT strategies are basically knowledge based techniques. Linguistic knowledge in the form of rules is applied externally separating sentences into possible linguistic unit for both source and target language. RBMT methodologies require syntactic, semantic and morphological analysis in context of grammar and lexicon. These approaches are: i) Direct, ii) Transfer and iii) Interlingua approach. Example based approaches are mostly data driven and analogy based where a set of translated texts have already stored in a bilingual database. These methods attempt in the fashion of such kind of methods are: i) Corpus based and ii) Statistical machine translation.

Direct Approach

The most primitive MT method is direct translation which is implemented between pairs of languages and based on morphological analysis and glossaries. It relies too much on dictionary look-up. In the direct translation approach, the SL text is analyzed operationally based on morphology for both source and target language pair. Direct Approach has five steps to translate:

Source sentence: তারা ফুটবল খেলছে।

Morphological Analysis:

তারা ফুটবল খেলছে PRESENT CONTINUOUS

Constituent Identification:

Reorder:

Dictionary Look up:

Inflect:

They are playing football

Target sentence: They are playing football.

Transfer Approach

This approach performs translation task considering the structural differences between the source and target language. It requires to know syntactic structures of languages.

The transfer model involves three stages: i) Analysis, ii) Transfer and iii) Generation. In the first stage, the source sentence is parsed and the sentence structure and the constituents are identified. In the next stage, transformations are applied to the source language parse tree to convert the structure to that of the target language. Finally, the translation is done on the basis of morphology of target language. In other words, this method can be summarized as: first parse, then reorder, finally translate. Figure 1 shows an illustration of this approach.

Source sentence: আমরা ফুটবল খেলি।

Analysis:

Sentence—আমরা [SUB] + ফুটবল [OBJ] + খেলি [VERB]

Transfer:

Sentence—SUB+OBJ+VERB  SUB+VERB+ OBJ

Generation:

Sentence—আমরাWe+খেলিPlay+ফুটবলfootball.

Target sentence: We play football.

Step 1: Analysis Step 2: Transfer Step 3: Generation

Transfer approach.

Interlingua Approach

Interlingua approach investigates a language-neutral analysis of the text. In this approach, the translation task comprises of two phases. First, the Source Language (SL) is converted into an intermediary form called Interlingua (IL) and then IL invokes the generation of text for Target Language (TL). IL shares an independent underlying representation from which translations can be generated to different TLs.

Source sentence: আমরা কলম দিয়ে লেখি।

Analysis:

Interlingua (IL) Representation:

AGENT we

ACTION write

INSTRUMENT pen number: singular

TENSE present

Synthesis:

We write (with) pen.

Target sentence: We write with pen.

Corpus based Approach

Corpus-based machine translation (CBMT) approach is characterized by the use of a bilingual corpus at run time instead of human encoded linguistic knowledge. Previously translated texts are stored in a parallel training corpus and new sentences to be translated are treated as test set. The idea of this translation approach mitigates the need of prior translation rules and inspires to reuse the examples to create knowledge.

The method, at first, decomposes source sentence into fragments, finds translation for each of those from parallel corpus and then recomposes them accordingly. The most amazing thing with CBMT is it can be applied to any language pairs that have a parallel corpus and the only linguistic thing is to know is how to split into sentences.

Source sentence: তারা বাগানে কাজ করছে।

Bilingual Corpus:

তারাপড়ছে — They arereading.

আমরাখামারেকাজ করছি — We arein the farmworking.

তুমিবাগানেখেলছ — You arein the gardenplaying.

Target sentence: They are working in the garden.

Statistical Machine Translation

Statistical machine translation (SMT) is a data-oriented empirical translation framework which is based on probability distribution function. It finds the most likely translation among all possible target sentences by calculating the highest probability using Eq.

t ̂_1^I=(arg max)┬(t_1^I )⁡〖 {Pr(t_1^J |s_1^I )}〗

e ̂_1^I=(arg max)┬(t_1^I )⁡〖 {Pr(t_1^I ).Pr(s_1^J |t_1^I )}〗 … (1)

Given, a source sentence s_1^J=s_1…s_J to be translated into a target sentence t_1^I=t_1…t_I, where J and I indicates the number of words in the source and target sentence, respectively. The argmax operation denotes the search to generate output sentence while Pr(t_1^I ) is the language model of the target language and Pr(s_1^J |t_1^I ) is the translation model. It is noted that translation model of SMT assigns higher probability to the corresponding translation using a bilingual corpus while language model consigns to fluent or grammatically correct sentence from a monolingual corpus. SMT also requires search techniques and alignments to get the output.

Read more