Ali Asghar Behmanesh & Abdol Hamid Pilevar

NLP Lab & Computer Engineering Dept., Bu Ali Sina University

Statistical part of speech tagger for Persian words

Abstract. Corpora tagged with Part of speech (POS) information are often used as a prerequisite for more complex NLP applications such as information extraction, syntactic parsing, machine translation or semantic field annotation. They are also used to help train statistical models. This paper presents a Maximum Likelihood Estimation (MLE) method for evaluation of part of speech tagging on Persian texts. The MLE approach has been used for handling the unknown words in the proposed methods. Three pre-processing technique is implemented for improving the accuracy of the results. The experiments have been conducted on a manually part of speech tagged Persian corpus with over two millions of tagged words. The best accuracy that was achieved by the proposed MLE tagging methods was 96.07%, which it is satisfactory compare to some other known similar methods.