Improving Machine Learning Performance by Eliminating the Influence of Unclean Data


Regardless of the data source and type (text, digital, photo group, etc.), they are usually unclean data. The term (unclean) means that data contains some bugs and paradoxes that can strongly impact machine learning processes. The nature of the input data of the dataset is the most important reason for the success of the learning algorithm. More than one factor influences machine learning results in a specific task. The characteristics and the nature of the data are the main reasons for the algorithm's success. This paper generally examines data processing entered into an algorithm to learn machines. The paper explains the operations of each stage of prior treatment data for the best achievement of its data set. In this paper, four models for teaching machines (SVM, Multiple Bayes - NB, and Bernoulli - NB) will be used. Best accuracy (Bernoulli - NB) model 89%. The pre-processing algorithm applied to the data set (dirty data) will be developed and compared to previous results before development. The Bernoulli-NB model reaches 91% accuracy and improves the value of the rest of the models used in this process