Privacy Preserving in Data Mining

Abstract

Privacy preserving data mining is a latest research area in the field of data mining. It is defined as “protecting user’s information”. Protection of privacy has become an important in data mining research because of the increasing ability to store personal data about users and the development of data mining algorithms to infer this information. The main goal in privacy preserving data mining is to develop a system for modifying the original data in some way, so that the private data and knowledge remain private even after the mining process. In this paper we proposed system that used PAM clustering algorithm in health datasets in order to generate set of clusters, then we suggested to select only one cluster to be hidden between another clusters in order to increasing the privacy of users information .The selected cluster are considered as sensitive cluster. Protecting the sensitive cluster is done by using privacy techniques through of modifying the data values(attributes) in the dataset. We suggest to use randomization techniques )Additive Noise , Data Swapping( and Data copying (which it is new suggested technique in this thesis) to prevent attacker from concluding users privacy information in the sensitive cluster. After modification the same clustering algorithm is applied for modified data set to verify whether the selected cluster are hidden or not. Experimental results on these proposed techniques proved that the PAM algorithm is efficient for clustering in all data sets and the selected cluster are protected efficiently by using (Additive Noise , Data Swapping, Data Copying) techniques. These techniques are applied on Wisconsin breast cancer, diabetes and heart stat log data set. The privacy ratio on heart stat log data set was 48%, 52.1739 % and 31.25% in Data Copying, Additive Noise and Data Swapping techniques, respectively, because these kinds of data sets have the special property that they are extremely sparse. Experimental results also proved that the Data copying technique is faster than the existing techniques (swapping and noise addition), finally the results of proposed system proved that the distortion of data can be reduced when the privacy ratio was increased. These are an important issues in PPDM, therefore the proposed system is highly successful in achieving the protection of privacy.