GIZA++ : the wordMint implementation

GIZA++ is actually a statistical machine translation toolkit for IBM Models 1-5 training and HMM word alignment. GIZA++ is a program for aligning words and sequences of words in sentence aligned corpora. If you have parallel corpus you can use GIZA++ to make bilingual dictionaries.

GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och.

In wordMint regard, GIZA++ is used to get character level alignments. GIZA++ is poorly documented but still README file helps. For wordMint purpose we needed character level alignment by GIZA++, which proved to be a difficult task. After exploring a work around, character level alignment of n-gram characters was done.

For this, a named entity word was assumed to be representing a sentence and characters in the word are parts of sentence. HMM and IBM model 1,3 and 4 iterations were run on word aligned bilingual parallel corpora. Input files to GIZA++ included language pair files in below format -
श क ु न ् त ल ा [Hindi Text File] < - > s h a k u n t l a [English Text File]

GIZA++ Package Programs :

• GIZA++: GIZA++ itself
• plain2snt.out: simple tool to transform plain text into GIZA format
• plain2snt.out: simple tool to transform GIZA format into plain text
• trainGIZA++.sh: Shell script to perform standard training given a corpus in GIZA format
• mkcls: Computes word classes in a monolingual corpus
• snt2cooc: Generates a coocurrence file

Run plain2snt.out located within the GIZA++ package to prepare text files for GIZA++ input
./plain2snt.out hindi english

Files created by plain2snt
english.vcb
hindi.vcb
hindienglish.snt

english.vcb consists of:
each word from the english corpus
corresponding frequency count for each word
an unique id for each word

hindi.vcb consists of:
each word from the hindi corpus
corresponding frequency count for each word
an unique id for each word

hindienglish.snt consists of:
each sentence from the parallel english and french corpi translated into the unique number for each word

Now, run GIZA++ located within the GIZA++ package
./GIZA++ -S hindi.vcb –T english.vcb –C hindienglish.snt

GIZA++ produces several files (distortion probability model, word alignments probability model, etc.). But there is one file that GIZA++ produces that is the only one we are going to care about. Its extension is: * .A3.final. This file is an alignment probability table for HMM alignment mode ( *.A3.* ). This is the one that has the final word-to-word alignment for each of the words of each line (in the same order as the input parallel corpus).

Python scripts were written to perform text processing to get required format for input files and parsing base corpora.


About this entry