HMM Based POS Tagging System for 8 Different Languages and Several Tagsets

Abstract

We propose, in this paper,Part-Of-Speech (POS) tagging system is proposed which based on Hidden Markov Model (HMM) for several languages. HMM is implemented using Viterbi algorithm on 8 languages; English, Hindi, Telugu, Bangla (Bengali), Marathi, Standard Chinese, Portuguese and Spanish. The data for these languages were taken from the freely available corpora: Brown, NPS-Chat, Indiana, Sinica, Floresta and CESS-ESP Corpora. HMM is the most learning method used in many NLP applications, especially POS tagging. HMM taggerwas implemented by other researchersfor a lot of languages, where each one take his mother tongue language.system testing is done by splitting each corpus to 99% training and 1% testing. This testis repeated for 10 times by changing the training and test data. The accuracies (average for all 10 tests) for English (using two tagsets of 40 tags and 472 tags), English (NPS corpus), Hindi, Telugu, Bangla or Bengali, Marathi, Standard Chinese, Portuguese (using two tagsets of 32 tags and 269 tags), and Spanish (using two tagsets of 14 tags and 289 tags) are (95.3%& 92.39%), 87.17%, 81.3%, 74.03%, 72.01%, 69.56%, 87.59%, (84.56%& 83.95%), and (94.26%& 92.08%) respectively.Several languages are taken for recording the limitations of HMM tagger on different languages as will be seen, I.e, the limitations of using one method on many different languages are recorded. Same corpus annotated with different tagsetsis taken for studying the effect of tagset’s size.Also two different corpora, for the same language, are taken. According to our knowledge, there isn’t study implemented HMM on such various cases as in our work.We provide an executable application for tagging all words in any sentence for any of the used 8 languages in our work. The unknown words (words not exist in the trained data) are manipulated by a simple method as Laplace smoothing.