Thursday, January 06, 2011

Natural Language Processing Tools for Chichewa

Natural language processing (NLP) is an excellent discipline of computer science focusing on developing artificial intelligence systems that are able to interact with human beings in their natural languages. The expert systems so developed try to understand patterns of human languages and process the given data (text or speech) accordingly. The expert tools are very useful in various systems. For example some feature/smart phones are able to read the name of the caller for you as the phone rings. Other applications like word processors are able to read a big document and generate for you a summary document from it. We also have nice applications (e.g. Google Translate) that read text in one  human language and translate it into another target language. All these are products of the field of NLP.

I have been working on big projects for Chichewa, a lingua franca for Malawi (formerly it's national language), Zambia and some parts of Mozambique and Zimbabwe. These systems are still in progress of perfection, but they currently are able to do great stuff at this stage. I am sure that the end of these projects will put Chichewa somewhere as far as NLP is concerned. I would like to share with you my experiences.

ChicMorph: A Morphological Analyzer for Chichewa Verbs
    Chichewa is agglutinative in nature. One word/phrase  is a combination of several sub-words (techinically called morphemes). For example sindibweranso (Lit: I am not coming again) can be broken as follows: si(not)-ndi(I)-bwer(come)-a-nso(again). Notice that the "a" has no literal meaning. It is just a final vowel to complement bwer, the stem of that verb.

    ChicMorph takes raw Chichewa verbs, discovers and isolates the verb constituent morphemes. Some Chichewa verbs are tricky in that their roots also include subwords (morphemes) that are also morphemes on their own. For example, er is an applicaticative morpheme as in gwera (gw-er-a). But it is not  a morpheme in bwera (hence bw-er-a is incorrect, but bwer-a). I have so far improved ChicMorph to evaluate correctly verbs with roots constituting morphemes that are also prefixal or suffixal allomorphs like these ones.
     ChicPOS: Part of Speech Tagger
      From August 2010, I have been working on a Chichewa part of speech tagger and it is doing great. I am hoping to make more breakthroughs in due course. Right now, chicPOS understands all Chichewa parts of speech including punctuations:
      •  Mwana womaliza uja wa a Phiri wabwera kudzagula mchere. (Lit: That last born child to Mr. Phiri has come to buy salt.) => Mwana[NN] womaliza[JJ] uja[DEM] wa[IN] a[HON] Phiri[NNP] wabwera[VB] kudzagula[VB] mchere[NN] .[.]
      Key: DEM => Demonstrative Adjective, HON => Honorific a, IN =>Preposition, JJ => Adjective, NN => Noun, NNP => Proper Noun, POSS => Possessive Adjective, PN => Pronoun, PRP => Personal Pronoun, VB =>  Verb.

      ChicPOS is also able to identify proper nouns within a given phrase. Compare usage of "Talandira" in the following phrases:
      •  Talandira ndalama kuchokera kwa a Chikale. (Lit: We have received money from Mr. Chikale.) => Talandira[VB] ndalama[NN] kuchokera[VB] kwa[ASSOC] a[HON] Chikale[NNP] .[.]
      • Ndamuona Talandira akudutsa apa. (Lit: I have seen Talandira passing by here.) => Ndamuona[VB] Talandira[NNP] akudutsa[VB] apa[DEM] .[.]
      ChicPOS fails to identify proper nouns in some positions, especially when they begin a sentence as in Talandira akudutsa apa. (Lit: Talandira is passing by here). =>  Talandira[VB] akudutsa[VB] apa[DEM] .[.] (compare it with: Akudutsa apa Talandira. (Lit: He/She is passing by here, Talandira) => Akudutsa[VB] apa[DEM] Talandira[NNP] .[.]). Proper nouns are tricky even in "natural/daily conversations" looking at the way names(Proper nouns) are formed in Chichewa. Some proper names originate from verbs/verb phrases (e.g. Talandira => (Lit: We have received), Kalinda-kadye (Lit: It waits to eat)) while others from common nouns (Chipiriro (Patience), Ulemu (Politeness/Respect)). Notice that somehow ChicPOS is also correct  in this special case: Talandira akudutsa apa. =>  Talandira[VB] akudutsa[VB] apa[DEM] .[.] The reason is since Talandira originates from a verb , by just changing the tone of the phrase Talandira akudutsa apa. will translate to We have recieived (something) while he was passing here.. In short, I should say I am still exploring this concept of proper nouns. 

      Right now, I have six thousand Chichewa words (thanks to Prof. Kevin Scannell for compiling the initial wordlist using his An Crubádan). I am in the process of tagging them, and I will be adding some more words. A note on tags, I have tried to preserve popular tags like NN, JJ but for words that I could not find one I prioritized short forms outlined in The Syntax of Chichewa by Prof. Sam Mchombo. Otherwise, I generated my own. I am hoping to create a standardized form for Chichewa (and eventually for other Malawian languages). I am also looking at some similar work in Swahili and Nguni languages.

      AffixGen: Chichewa Verb Generator
        In due course, I also developed a "Chichewa verb generator". It automatically generates 66082 prefixes, 2870 suffixes (using CARP [Causative-Applicative-Reciprocal-Passive] and RCAP suffix combination; in RCAP the reciprocal precedes the other suffixes as in menyanitsa). The suffix extension can take up to three clitics at the moment. For each single verb root, it generates 66082 x 2870 = 189,655,340 possible verb forms. This is awesome because if you have 10 Chichewa verb roots, you are able to generate close to 2 billion Chichewa verbs!! Of course some of them may not be as sensible due to some semantical encodings behind them (compare menya and bwera => akuzimenyanitsa vs akuzibweranitsa.) I am still working on this. I would like to collect all(?????) verb roots (Ha!Ha!Ha! if I can manage) and isolate them accordingly so that such funny combinations do not occur any more, or at least the error rate is reduced drastically. Right now I have 500 verb roots and the system is able to generate 94,827,670,000 (94 Billion) verbs!!!!

        I am using AffixGen output to build plugins for Hunspell spellchecker, and I have so far created two plugins, one for Firefox and another for OpenOffice (It is available online on Openoffice.org website. Of course, the online one is not up to date yet). 

        ChiVisualize: Dynamic Visualization Tool of Chichewa Phrase Structures
          In line with ChicPOS, I am creating a visualization tool for Chichewa phrase structures. This is another great art work that I have ventured into. ChiVisualize text tagged phrases and build a syntax tree as in the following example:
          Mkango[NN] uja[DEM] ukuba[VB] mikanda[NN] yanu[POSS] (That lion is stealing your beads.)
          The syntax tree is interactive and dynamic. You can change the orientation in four directions: top, bottom, left and right. You can also emphasize on a particular level in any of the sub-trees. The system is able to "virtually" simulate the all six Chichewa phrase structures: SVO, SOV, VSO, VOS, OVS and OSV. ChiVisualize uses JavaScript InfoVis Toolkit to create these interactive visualization.

          ChiVisualize is still in its formative stages. Right now, the tagged text is processed into a base phrase structure (as defined by Chomsky's Minimalist Theory/Program) manually and given to ChiVisualize for syntax tree generation. Currently, I am working on an algorithm that will be able to automatically generate a Base Phrase Structure for given tagged text.

          Later on, I will combine ChicPOS, ChicMorph and ChiVisualize into one application. With the new system, one will just be giving it a "normal" Chichewa Phrase and it will be doing all the processing itself. ChicPOS will be generating tagged text and give it to ChiVusualize for visualization. On the other hand, ChicMorph will produce extra morphological constraints that will be displayed when one emphasizes on a certain phrase constituent in a given syntax tree. I will also add a transformational-generative grammar parser that will be able to resolve matching of argument markers to their respective nominals if present in a given phrase. Of course, I am aware of ambiguities resulting from free word ordering and NPs from same classes as depicted in the following: Galimoto ng'ombe yayigunda (galimoto => car, ng'ombe => cow, yayigunda => 'has hit'). (which one hit the other here? FYI: galimoto and ng'ombe fall in the same noun class). One way will be to leave it strictly non-configurational such that the phrase will be illustrated as S = NP + NP + V (or any of its combinations, without a VP). But I'll cross the bridges when I'll come to them :-), :-).

          By the end of everything, I would like to build a head-driven phrase structure grammar checker for Chichewa. This will be useful not only in linguistics, but also in real-world applications like word processing software.