<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>wordMint</title>
	<atom:link href="http://www.wordmint.org/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.wordmint.org</link>
	<description>intelligent system for the smart people</description>
	<pubDate>Fri, 05 Jun 2009 11:12:00 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
	<language>en</language>
	<image>
<link>http://www.wordmint.org</link>
<url>http://74.86.66.194/~wordmint/blog/wp-content/plugins/maxblogpress-favicon/icons/favicon.ico</url>
<title>wordMint</title>
</image>
		<item>
		<title>GIZA++ : the wordMint implementation</title>
		<link>http://www.wordmint.org/2009/05/giza-the-wordmint-implementation/</link>
		<comments>http://www.wordmint.org/2009/05/giza-the-wordmint-implementation/#comments</comments>
		<pubDate>Wed, 06 May 2009 20:14:10 +0000</pubDate>
		<dc:creator>Team wordMint</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[GIZA++]]></category>

		<guid isPermaLink="false">http://www.wordmint.org/?p=16</guid>
		<description><![CDATA[GIZA++ is actually a statistical machine translation toolkit for IBM Models 1-5 training and HMM word alignment. GIZA++ is a program for aligning words and sequences of words in sentence aligned corpora. If you have parallel corpus you can use GIZA++ to make bilingual dictionaries.
GIZA++ is an extension of the program GIZA (part of the [...]]]></description>
			<content:encoded><![CDATA[<p>GIZA++ is actually a statistical machine translation toolkit for IBM Models 1-5 training and HMM word alignment. GIZA++ is a program for aligning words and sequences of words in sentence aligned corpora. If you have parallel corpus you can use GIZA++ to make bilingual dictionaries.</p>
<p>GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och.</p>
<p>In wordMint regard, GIZA++ is used to get character level alignments. GIZA++ is poorly documented but still README file helps. For wordMint purpose we needed character level alignment by GIZA++, which proved to be a difficult task. After exploring a work around, character level alignment of n-gram characters was done.</p>
<p>For this, a named entity word was assumed to be representing a sentence and characters in the word are parts of sentence. HMM and IBM model 1,3 and 4 iterations were run on word aligned bilingual parallel corpora. Input files to GIZA++ included language pair files in below format -<br />
श क ु न ् त ल ा [Hindi Text File] &lt; - &gt; s h a k u n t l a [English Text File]</p>
<p>GIZA++ Package Programs :</p>
<p>• GIZA++: GIZA++ itself<br />
• plain2snt.out: simple tool to transform plain text into GIZA format<br />
• plain2snt.out: simple tool to transform GIZA format into plain text<br />
• trainGIZA++.sh: Shell script to perform standard training given a corpus in GIZA format<br />
• mkcls: Computes word classes in a monolingual corpus<br />
• snt2cooc: Generates a coocurrence file</p>
<p>Run plain2snt.out located within the GIZA++ package to prepare text files for GIZA++ input<br />
./plain2snt.out hindi english</p>
<p>Files created by plain2snt<br />
english.vcb<br />
hindi.vcb<br />
hindienglish.snt</p>
<p>english.vcb consists of:<br />
each word from the english corpus<br />
corresponding frequency count for each word<br />
an unique id for each word</p>
<p>hindi.vcb consists of:<br />
each word from the hindi corpus<br />
corresponding frequency count for each word<br />
an unique id for each word</p>
<p>hindienglish.snt consists of:<br />
each sentence from the parallel english and french corpi translated into the unique number for each word</p>
<p>Now, run GIZA++ located within the GIZA++ package<br />
./GIZA++ -S hindi.vcb –T english.vcb –C hindienglish.snt</p>
<p>GIZA++ produces several files (distortion probability model, word alignments probability model, etc.). But there is one file that GIZA++ produces that is the only one we are going to care about. Its extension is: * .A3.final. This file is an alignment probability table for HMM alignment mode ( *.A3.* ). This is the one that has the final word-to-word alignment for each of the words of each line (in the same order as the input parallel corpus).</p>
<p>Python scripts were written to perform text processing to get required format for input files and parsing base corpora.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wordmint.org/2009/05/giza-the-wordmint-implementation/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Transliteration Corpus for wordMint</title>
		<link>http://www.wordmint.org/2009/05/transliteration-corpus-for-wordmint/</link>
		<comments>http://www.wordmint.org/2009/05/transliteration-corpus-for-wordmint/#comments</comments>
		<pubDate>Thu, 30 Apr 2009 19:52:08 +0000</pubDate>
		<dc:creator>Team wordMint</dc:creator>
		
		<category><![CDATA[News]]></category>

		<category><![CDATA[Updates]]></category>

		<category><![CDATA[corpus]]></category>

		<guid isPermaLink="false">http://www.wordmint.org/?p=15</guid>
		<description><![CDATA[As we have been working for past sometime on preparation of training corpus, we have come up with a good quality corpus for english to hindi back transliteration which is sentence aligned. The corpus is licensed under Creative Commons Attribute Share-alike India 2.5 License. So you can use/modify/distribute the corpus for any purpose as long [...]]]></description>
			<content:encoded><![CDATA[<p>As we have been working for past sometime on preparation of training corpus, we have come up with a good quality corpus for english to hindi back transliteration which is sentence aligned. The corpus is licensed under Creative Commons Attribute Share-alike India 2.5 License. So you can use/modify/distribute the corpus for any purpose as long as you attribute the work to the wordMint team and keep the freedoms intact.</p>
<p>The corpus is a collection of about 100 songs which are written in romanized hindi and parallel hindi in devnagari.</p>
<p>Click on the download link below to download the complete corpus.</p>
<p><a title="download the corpus" href="http://www.wordmint.org/blog/downloads/transliteration-corpus.tar.gz">Download</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.wordmint.org/2009/05/transliteration-corpus-for-wordmint/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Algorithm : wordMint</title>
		<link>http://www.wordmint.org/2009/03/algorithm-wordmint/</link>
		<comments>http://www.wordmint.org/2009/03/algorithm-wordmint/#comments</comments>
		<pubDate>Tue, 24 Mar 2009 19:47:48 +0000</pubDate>
		<dc:creator>Team wordMint</dc:creator>
		
		<category><![CDATA[Algorithm]]></category>

		<guid isPermaLink="false">http://www.wordmint.org/?p=14</guid>
		<description><![CDATA[Given is the algorithm for wordMint project. This is a flexible approach and we can accommodate some changes in the future.
Algorithm
1. User inputs the string to be processed/transliterated. This will be in roman script.
2. Now there are few possibilities:
i. The word entered is in English. This is an example of forward transliteration. We put the [...]]]></description>
			<content:encoded><![CDATA[<p>Given is the algorithm for wordMint project. This is a flexible approach and we can accommodate some changes in the future.</p>
<p><strong>Algorithm</strong></p>
<p>1. User inputs the string to be processed/transliterated. This will be in roman script.</p>
<p>2. Now there are few possibilities:</p>
<p>i. The word entered is in English. This is an example of forward transliteration. We put the word into a language identification model (in broad sense, this term is used if the algorithm is capable of working in multiple languages. Here in our case, it will be primarily be Hindi and English. A dictionary check for English would reveal everything. Advanced methods are available in the case of multiple scripts/languages.)If the word is in English, CMU dict will be used (It lists all the English words will with their pronunciation in IPA). We can map IPA to Devanagari characters. This should not be a difficult task. These words are then considered for backward transliteration also. Move on to step 3 for that.</p>
<p>ii. The word entered could be in English, but the spelling may be incorrect. For example: &#8220;transliteration&#8221; or &#8220;traanslitareshan&#8221;<br />
Here we can put a modified edit distance to check if the word entered relates to some English word. If the score is pretty good, then we can consider it for forward transliteration as well as backward transliteration (the same might not be the case that it is necessarily an English word).</p>
<p>iii. The word entered is a Hindi word. obviously, if the word is not an English word, and the edit distance is not very good, it should be a Hindi word. in this case we&#8217;ll move on to next step.</p>
<p>3. At this point, we are considering a generative model for transliteration. We&#8217;ll use the predefined probabilities generated by the alignment process (i.e. the training model). The word generated using the top most priority graphemes is given a score. Similarly the words generated using the lower probabilities are also given a score. Now these words are checked against the monolingual Hindi dictionary we already have. This dictionary will contain all kinds of Hindi words (we&#8217;ll have to check its size for optimization). The checking process with the monolingual dictionary is done using edit distance algorithm (different implementations of edit distance algorithm are available, see below for a little explanation provided here). If the distance score is good ( i.e. less is the distance) or say above a threshold score, we then consider it.</p>
<p>4. Now we have two kinds of score, one is generated using the generative model and one score is generated using the edit distance between the words. the edit distance score is calibrated to the scale of generative model (this may seem a bit complex here, but after discussion we&#8217;ve found that this may be possible). Now both the scores are added to each other and a final score is produced which decides the rank of these words.<br />
In case the edit distance score is not able to qualify for the second stage (i.e. where the two scores are added), we may consider the edit distance score to be 0.<br />
Now those words which were left at the stage (i) and (ii) also have to be considered (English words written correctly and which were in English but written incorrect) here. This creates a complexity for us.<br />
As a word may have a meaning in English as well as in Hindi Example<br />
&#8220;main&#8221; &#8212; मैं or मेन</p>
<p>5. We can discuss the final scoring process for such words.<br />
Now the things which are left are, context sensitiveness of the model. Context sensitiveness here refers to the context of word. It may be an English word or a Hindi word, some different contexts are also possible. If the context prefers an English word, then the English word should be preferred. (This can be discussed if we will directly give it the first priority or include it in the final scoring process and then output the result).<br />
This is a very important case because one spelling can refer to more than English or Hindi relevant words. We assume that this work can be done statistically. We can discuss upon the feasibility of such a modeln</p>
<p><strong>Training Process:</strong></p>
<p>We going to use GIZA++ for the training process. Giza++ will give the character level alignments with probabilities of particular alignments. We need to have an n-gram model. The n-gram conditional probablities will be generated by multiple iterations of training set with Giza++ with everytime the training data being segmented according to n-grams in a lookup inventory.</p>
<p><strong>Edit Distance Algorithm:</strong></p>
<p>Giving a short introduction to the basics. This algorithm  checks the distance between two words/string. The distance is equal to the number of steps took to convert one word into the the other.<br />
In this conversion process, the possible actions are : deletion of a char, insertion of a character, substitution of one character with another.</p>
<p>Example: kitten vs sitting<br />
1.kitten -&gt; sitten (substitution)<br />
2.sitten -&gt; sittin (substitution)<br />
3.sittin -&gt; sitting (insertion)</p>
<p>Some variations to this algorithm are available. Provided is very basic stuff.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wordmint.org/2009/03/algorithm-wordmint/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Thanks sarai - CSDS</title>
		<link>http://www.wordmint.org/2008/11/thanks-sarai-csds/</link>
		<comments>http://www.wordmint.org/2008/11/thanks-sarai-csds/#comments</comments>
		<pubDate>Wed, 19 Nov 2008 04:37:05 +0000</pubDate>
		<dc:creator>criss</dc:creator>
		
		<category><![CDATA[News]]></category>

		<category><![CDATA[Updates]]></category>

		<category><![CDATA[CSDS]]></category>

		<category><![CDATA[Fellowship]]></category>

		<category><![CDATA[FLOSS]]></category>

		<category><![CDATA[sarai]]></category>

		<guid isPermaLink="false">http://www.wordmint.org/?p=12</guid>
		<description><![CDATA[Thanks sarai - CSDS
Congratulation wM  family. SARAI  has decide to help wordMint  project. wordMint is selected for  The Sarai - CSDS Independent FLOSS Fellowship 2008.
This will accelerate wordMint. We hope that with this fellowhip, the first version of wordMint will be released within six month.
Thanks SARAI - CSDS for showing confidence [...]]]></description>
			<content:encoded><![CDATA[<p>Thanks sarai - CSDS</p>
<p>Congratulation wM  family. SARAI  has decide to help wordMint  project. wordMint is selected for  <a href="http://www.sarai.net/fellowships/floss">The Sarai - CSDS Independent FLOSS Fellowship 2008</a>.</p>
<p>This will accelerate wordMint. We hope that with this fellowhip, the first version of wordMint will be released within six month.</p>
<p>Thanks <a href="http://www.sarai.net">SARAI</a> - <a href="http://www.csds.in/">CSDS</a> for showing confidence in us.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wordmint.org/2008/11/thanks-sarai-csds/feed/</wfw:commentRss>
		</item>
		<item>
		<title>wM Algorithm</title>
		<link>http://www.wordmint.org/2008/08/wm-algorithm/</link>
		<comments>http://www.wordmint.org/2008/08/wm-algorithm/#comments</comments>
		<pubDate>Fri, 01 Aug 2008 12:45:19 +0000</pubDate>
		<dc:creator>criss</dc:creator>
		
		<category><![CDATA[Algorithm]]></category>

		<category><![CDATA[Developement]]></category>

		<category><![CDATA[Updates]]></category>

		<category><![CDATA[Statistical]]></category>

		<category><![CDATA[WFSM]]></category>

		<guid isPermaLink="false">http://www.wordmint.org/?p=11</guid>
		<description><![CDATA[We are discussing wM Algorithm on our wiki these days.
Kapil has suggested use of statistical algorithm. Statistical algorithms has an advantage of language neutral designs. He said that this will result a simpler design of wM.
Recently Criss and Jaiz came up with a new idea. It&#8217;s kind of modified statistical and correspondence based weighted finite [...]]]></description>
			<content:encoded><![CDATA[<p>We are discussing wM Algorithm on our wiki these days.</p>
<p>Kapil has suggested use of statistical algorithm. Statistical algorithms has an advantage of language neutral designs. He said that this will result a simpler design of wM.</p>
<p>Recently Criss and Jaiz came up with a new idea. It&#8217;s kind of modified statistical and correspondence based weighted finite state machine. <img src='http://www.wordmint.org/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.wordmint.org/2008/08/wm-algorithm/feed/</wfw:commentRss>
		</item>
		<item>
		<title>wM is married to Python</title>
		<link>http://www.wordmint.org/2008/08/wm-is-married-to-python/</link>
		<comments>http://www.wordmint.org/2008/08/wm-is-married-to-python/#comments</comments>
		<pubDate>Fri, 01 Aug 2008 07:43:24 +0000</pubDate>
		<dc:creator>criss</dc:creator>
		
		<category><![CDATA[Developement]]></category>

		<category><![CDATA[Updates]]></category>

		<category><![CDATA[Developement Langauge]]></category>

		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://www.wordmint.org/?p=10</guid>
		<description><![CDATA[We have a good news. wordMint is married to Python.
Recently Jaiz proposed that we should develope the wM project in python. He said specific advantage of python would be its interpetedness and platform indpendance also it is a language of Open Source World.
Java and C++ were close competitor but Python was beautiful enough to get [...]]]></description>
			<content:encoded><![CDATA[<p>We have a good news. wordMint is married to Python.</p>
<p>Recently Jaiz proposed that we should develope the wM project in python. He said specific advantage of python would be its interpetedness and platform indpendance also it is a language of Open Source World.</p>
<p>Java and C++ were close competitor but Python was beautiful enough to get them rejected. Java is yet not standardized and lacks any extendability. C++ is not exactly rejected as python can be extended by any C or C++ Module (for more information click <a title="navigate to python documentation site" href="http://www.python.org/doc/ext/intro.html">here</a>).</p>
<p>Python can be readily integrated in CGI, which would make same tool available online and offline (for more information click <a title="navigate to python documentation site" href="http://docs.python.org/lib/module-cgi.html">here</a> and <a title="computer science virginia university" href="http://www.cs.virginia.edu/~lab2q/">here</a>). Although Java&#8217;s EJB could have also been an option  but it involves the J2EE servers.</p>
<p>Well we have open options of developing different language modules forks. The official language of wM is Python.</p>
<p>[You can find more comparision <a href="http://www.ferg.org/projects/python_java_side-by-side.html">here</a>]</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wordmint.org/2008/08/wm-is-married-to-python/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Devlopers meeting</title>
		<link>http://www.wordmint.org/2008/08/devlopers-meeting/</link>
		<comments>http://www.wordmint.org/2008/08/devlopers-meeting/#comments</comments>
		<pubDate>Fri, 01 Aug 2008 07:00:22 +0000</pubDate>
		<dc:creator>criss</dc:creator>
		
		<category><![CDATA[Updates]]></category>

		<category><![CDATA[developers]]></category>

		<category><![CDATA[Meeting]]></category>

		<guid isPermaLink="false">http://www.wordmint.org/?p=9</guid>
		<description><![CDATA[We recently had a meeting with all the developers except CP . Meeting agenda was project discussion. We have decided that wordMint will be available in two forms. i.e. Online and a Off-line Software.
We have already selected Python as the devlopement language. Criss is undertaking desktop inteface and CP is undertaking the web interface of [...]]]></description>
			<content:encoded><![CDATA[<p>We recently had a meeting with all the developers except CP . Meeting agenda was project discussion. We have decided that wordMint will be available in two forms. i.e. Online and a Off-line Software.</p>
<p>We have already selected Python as the devlopement language. Criss is undertaking desktop inteface and CP is undertaking the web interface of WM. Kapil and Jaiz are working on the mighty algorithms. Criss will now be helping kapil and Jaiz due to overload in algorithm while he has nothing to do in desktop interface at the moment.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wordmint.org/2008/08/devlopers-meeting/feed/</wfw:commentRss>
		</item>
		<item>
		<title>wordMint baby born!</title>
		<link>http://www.wordmint.org/2008/06/hello-world-2/</link>
		<comments>http://www.wordmint.org/2008/06/hello-world-2/#comments</comments>
		<pubDate>Sun, 08 Jun 2008 17:56:36 +0000</pubDate>
		<dc:creator>Team wordMint</dc:creator>
		
		<category><![CDATA[General]]></category>

		<category><![CDATA[contact]]></category>

		<category><![CDATA[developers]]></category>

		<category><![CDATA[google]]></category>

		<category><![CDATA[mapping]]></category>

		<category><![CDATA[open source]]></category>

		<category><![CDATA[traditional model]]></category>

		<category><![CDATA[transliteration system]]></category>

		<category><![CDATA[wordMint introduction]]></category>

		<guid isPermaLink="false">http://www.wordmint.org/blog/?p=1</guid>
		<description><![CDATA[Welcome to the wordMint project Blog. So finally our work has started. While setting up a web interface for the project can be considered as primary or secondary for different people, we did it on the first place as it gives the project a way to interact with the outer world. So happy to see [...]]]></description>
			<content:encoded><![CDATA[<p>Welcome to the wordMint project Blog. So finally our work has started. While setting up a web interface for the project can be considered as primary or secondary for different people, we did it on the first place as it gives the project a way to interact with the outer world. So happy to see the work going on.</p>
<p>Thanks very much to Nalin Makar for the great looking theme <a title="HemingwayEx theme by Nalin Maker" href="http://www.nalinmakar.com/hemingwayex/" target="_blank">HemingwayEx</a>, though we modified it according to our logo and needs with a bit of CSS.</p>
<p>Now something about the project :</p>
<p>The project&#8217;s aim is to develop a free and open source intelligent transliteration system for english to hindi back transliteration mainly. We&#8217;ll try to make it with equal offline capabilities. For the same, artificial intelligence &#8212; statistical machine learning algorithhms are used. So that we can accurately predict what a person intended to write in Hindi.</p>
<p>In contrast with the traditional letter to letter mapping &#8212; grapheme model, the power of such a system lies in it&#8217;s intelligence. Where in the traditional system a person has necessarily to remember the character (it may be phonetic or not) map which makes the same work more complex.</p>
<p>while in the traditional model</p>
<p>we use character to character mapping (more technically grapheme to grapheme mapping) as</p>
<p>t &#8211;&gt; ट्<br />
th &#8211;&gt; त्<br />
T &#8211;&gt; ठ्</p>
<p>so whenever they are used in the source grapheme, irrespective of the context or correctness, the target grapheme is generated based on the mapping.</p>
<p>While in our proposed method, this work will be done intelligently. For example, writing &#8216;namaste&#8217; or &#8216;nmaste&#8217; both would produce &#8216;नमस्ते&#8217; , nulling the need of remembering the complex mapping. Only the knowledge of english keyboard and the pronunciation of different grapheme is enough.</p>
<p>Although different implementations are available, <a title="Google Indic Transliteration" href="http://www.google.com/transliterate/indic" target="_blank">google indic transliteration</a> being very popular, but no one of them is Open Source and also the offline feature is not available. Our target would be to make this system as efficient as possible.</p>
<p>This idea came out from the <a title="Workshop on Hindi/Urdu/Kashmiri Localization" href="http://www.sarai.net/about-us/events/workshops/workshop-on-hindi-urdu-kashmiri-localisation/workshop-on-hindi-urdu-kashmiri-localisation/?searchterm=None">workshop on Hindi/Urdu/Kashmiri Localization </a>organized at <a title="Website of Sarai" href="http://www.sarai.net/">Sarai</a>.</p>
<p>We have started our work, hope to get the first build in  two months. Click on the <a title="Developers contact information" href="http://www.wordmint.org/contact-us/">Contact Us</a> tab to contact us. You can send your suggestions or ideas to us by email as written on the <a title="Developers contact information" href="http://www.wordmint.org/contact-us/">Contact Us</a> page.</p>
<p>So happy <strong>minting</strong>!!!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wordmint.org/2008/06/hello-world-2/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>

