Context

 

Statistical machine translation investigates the translation from one natural language into another, using statistical models.

Traditionally the problem is formulated as finding the most probable English word e given a foreign sentence f. The formalism followed from the seminal paper Mathematics of machine translation is:

 

 

The denominator is omitted since it is a normalization factor, constant for all native sentences e.

The first term is called language model and it models how well formed is the produced sentence.

The second term is called translation model and models correspondence between the source and the target language.

 

Overview

 

In practise, the steps needed to built an SMT system are:

  1. Corpora gathering. This entails preprocessing of the corpus to get rid of unwanted characters (for example if the source is html code). Tokenization is also needed to have consistent representation of the words throughout the corpus. For the bilingual corpus one more step is required, sentence alignment, that is finding for each sentence in the native side of the corpus its translation(s) in the foreign side.
  2. Corpora alignment. In this step for each sentence pair the correspondence in word/phrase level is established.
  3. Translation/language model training. In this step the models are built, extracting information from the corpora in an unsupervised manner, using statistical methods.
  4. Decoding. Having built the necessary components, you are now ready to translate. Generally speaking, the decoder implements the aforementioned equation using the models built in order to translate.

Corpora gathering is a resource consuming process. Fortunately bilingual corpora are available to the scientific community. A widely used corpus is Europarl, which consists of the proceedings of the European Parliament.

Corpora alignment is usually done using the GIZA++ toolkit.

In order to build a language model several solutions exist. An example is SRILM.

For training the translation model one can follow different approaches. However a general framework is described in http://www.statmt.org/wmt08/baseline.html.

As for the decoder, several solutions are available. In the past Pharaoh was widely used, while now Moses is gaining in popularity, implementing more features and being open source. This list is by no means exhaustive, it is just meant to give an example tool used for each part of the process. For example another open source decoder available is Phramer.

One can also try to incorporate linguistic information into the system. There has been a lot of research in this field and many tools exist that can supply such information. For example part of speech tagging, shallow parsing or morphologic analysis. One can incorporate this information into the system hoping to improve performance. Using syntax aims at producing more correct/fluent translations, while morphology utilisation can help when dealing with morphology rich languages (like Czech, German, etc), especially when the amount of corpus available is small. However experience has shown that the best way to improve the performance of an SMT system is by using more data.

 

Contributors

Khalid Daoudi (contact)

Klasinas Ioannis (contact)