An Effective Preprocessing Step Algorithm in Text Mining Application


Text mining was a process of mining the significant information from the text documents. Any text mining system was created its process by preprocessing step; which involve tokenization, stop words removal, stemming and finally creating term frequency and inverse document frequency matrix (TF-IDF matrix). These steps provide the highest time consuming stage in knowledge discovery. The proposed method tries to build effective preprocessing step to even win area of memory space and time requirements. That by proposed a method for improved stop words removal algorithm and improved stemming algorithm based porter stemming algorithm. The proposed method is tested in two levels, first level uses only vector space model which based on used traditional stop words removal and with traditional porter stemming and the second level uses vector space model with combined features of improved stop words removal algorithm and improved stemming algorithm. The results show that using second level as effective preprocessing step for text mining application achieves good performance from reducing storage space used in memory about 10% and the processing time become faster which achieves good performance to build the final TF-IDF matrix.