Algorithm : wordMint
Given is the algorithm for wordMint project. This is a flexible approach and we can accommodate some changes in the future.
Algorithm
1. User inputs the string to be processed/transliterated. This will be in roman script.
2. Now there are few possibilities:
i. The word entered is in English. This is an example of forward transliteration. We put the word into a language identification model (in broad sense, this term is used if the algorithm is capable of working in multiple languages. Here in our case, it will be primarily be Hindi and English. A dictionary check for English would reveal everything. Advanced methods are available in the case of multiple scripts/languages.)If the word is in English, CMU dict will be used (It lists all the English words will with their pronunciation in IPA). We can map IPA to Devanagari characters. This should not be a difficult task. These words are then considered for backward transliteration also. Move on to step 3 for that.
ii. The word entered could be in English, but the spelling may be incorrect. For example: “transliteration” or “traanslitareshan”
Here we can put a modified edit distance to check if the word entered relates to some English word. If the score is pretty good, then we can consider it for forward transliteration as well as backward transliteration (the same might not be the case that it is necessarily an English word).
iii. The word entered is a Hindi word. obviously, if the word is not an English word, and the edit distance is not very good, it should be a Hindi word. in this case we’ll move on to next step.
3. At this point, we are considering a generative model for transliteration. We’ll use the predefined probabilities generated by the alignment process (i.e. the training model). The word generated using the top most priority graphemes is given a score. Similarly the words generated using the lower probabilities are also given a score. Now these words are checked against the monolingual Hindi dictionary we already have. This dictionary will contain all kinds of Hindi words (we’ll have to check its size for optimization). The checking process with the monolingual dictionary is done using edit distance algorithm (different implementations of edit distance algorithm are available, see below for a little explanation provided here). If the distance score is good ( i.e. less is the distance) or say above a threshold score, we then consider it.
4. Now we have two kinds of score, one is generated using the generative model and one score is generated using the edit distance between the words. the edit distance score is calibrated to the scale of generative model (this may seem a bit complex here, but after discussion we’ve found that this may be possible). Now both the scores are added to each other and a final score is produced which decides the rank of these words.
In case the edit distance score is not able to qualify for the second stage (i.e. where the two scores are added), we may consider the edit distance score to be 0.
Now those words which were left at the stage (i) and (ii) also have to be considered (English words written correctly and which were in English but written incorrect) here. This creates a complexity for us.
As a word may have a meaning in English as well as in Hindi Example
“main” — मैं or मेन
5. We can discuss the final scoring process for such words.
Now the things which are left are, context sensitiveness of the model. Context sensitiveness here refers to the context of word. It may be an English word or a Hindi word, some different contexts are also possible. If the context prefers an English word, then the English word should be preferred. (This can be discussed if we will directly give it the first priority or include it in the final scoring process and then output the result).
This is a very important case because one spelling can refer to more than English or Hindi relevant words. We assume that this work can be done statistically. We can discuss upon the feasibility of such a modeln
Training Process:
We going to use GIZA++ for the training process. Giza++ will give the character level alignments with probabilities of particular alignments. We need to have an n-gram model. The n-gram conditional probablities will be generated by multiple iterations of training set with Giza++ with everytime the training data being segmented according to n-grams in a lookup inventory.
Edit Distance Algorithm:
Giving a short introduction to the basics. This algorithm checks the distance between two words/string. The distance is equal to the number of steps took to convert one word into the the other.
In this conversion process, the possible actions are : deletion of a char, insertion of a character, substitution of one character with another.
Example: kitten vs sitting
1.kitten -> sitten (substitution)
2.sitten -> sittin (substitution)
3.sittin -> sitting (insertion)
Some variations to this algorithm are available. Provided is very basic stuff.
You’re currently reading “Algorithm : wordMint”, an entry on wordMint
- Published:
- 25.03.09 / 1am
- Category:
- Algorithm
- Tags:
- Algorithm
- Post Navigation:
- « Thanks sarai - CSDS
Transliteration Corpus for wordMint »