What are the methods of Privacy-preserving data mining?

Privacy-preserving data mining is an application of data mining research in response to privacy security in data mining. It is called a privacy-enhanced or privacy-sensitive data mining. It deals with obtaining true data mining results without disclosing the basic sensitive data values.

Most privacy-preserving data mining approaches use various form of transformation on the data to implement privacy preservation. Generally, such methods decrease the granularity of description to keep privacy.

For instance, they can generalize the data from single users to users groups. This reduction in granularity causes loss of data and probably of the utility of the data mining results. This is the trade-off between data loss and privacy.

Privacy-preserving data mining methods can be defined into the following elements which are as follows −

Randomization methods − These methods insert noise to the data to mask several values of data. The noise added should be large so that individual data values, particularly sensitive ones, cannot be fetched.

It must be added skillfully so that the final outcomes of data mining are generally preserved. There are various methods are designed to change aggregate distributions from the perturbed data.

The k-anonymity and l-diversity methods − Both of these methods alter single data so that they cannot be specifically identified. In the k-anonymity method, the granularity of data representation is reduced adequately so that some given data maps onto minimum k other records in the data. It needs techniques such as generalization and suppression.

The k-anonymity method is weak in that, if there is a uniformity of sensitive values inside a group, then those values can be inferred for the altered data. The l-diversity model was designed to manage this weakness by enforcing intragroup variety of sensitive values to provide anonymization. The objective is to create it sufficiently difficult for adversaries to use combinations of data attributes to exactly recognize single records.

Distributed privacy preservation − Large data sets can be partitioned and distributed either horizontally (i.e., the data sets are partitioned into multiple subsets of data and distributed across several sites) or vertically (i.e., the data sets are partitioned and distributed by their attributes), or in a set of both.

While the single sites cannot required to share their whole data sets, they can consent to limited data sharing with the use of several protocols. The complete effect of such methods is to support privacy for each single object, while changing aggregate results over some data.

Downgrading the effectiveness of data mining results − In several cases, even though the data cannot be available, the output of data mining (e.g, association rules and classification models) can result in violations of privacy. The solution can be to downgrade the efficiency of data mining by changing data or mining results, including hiding some association rules or somewhat distorting some classification models.