What is Data Transformation?

In data transformation, the data are transformed or combined into forms suitable for mining. Data transformation can involve the following −

Smoothing − It can work to remove noise from the data. Such methods contain binning, regression, and clustering.

Aggregation − In aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated to compute monthly and annual total amounts. This phase is generally used in making a data cube for the analysis of the data at multiple granularities.

Generalization − In Generalization, where low-level or “primitive” (raw) data are restored by larger-level concepts through the use of concept hierarchies. For instance, categorical attributes, such as street, can be generalized to larger-level concepts, such as city or country. Similarly, values for numerical attributes, such as age, can be mapped to larger-level concepts, like youth, middle-aged, and senior.

Normalization − In normalization, where the attribute data are scaled to fall within a small specified range, such as −1.0 to 1.0, or 0.0 to 1.0.

Attribute construction − In attribute construction, where new attributes are developed and added from the given set of attributes to facilitate the mining process.

Smoothing is a form of data cleaning and was addressed in the data cleaning process where users specify transformations to correct data inconsistencies. Aggregation and generalization provide as forms of data reduction. An attribute is normalized by scaling its values so that they decline within a small specified order, including 0.0 to 1.0.

Normalization is especially helpful for classification algorithms containing neural networks, or distance measurements such as nearest-neighbor classification and clustering. If using the neural network backpropagation algorithm for classification mining, normalizing the input values for each attribute measured in the training tuples will help speed up the learning phase.

For distance-based methods, normalization helps prevent attributes with initially large ranges (e.g., income) from outweighing attributes with initially smaller ranges (e.g., binary attributes). There are many methods for data normalization which are as follows −

Min-max normalization − It implements a linear transformation on the original data. Suppose that minA and maxA are the minimum and maximum values of an attribute, A. Min-max normalization maps a value, v, of A to v in the range [new_minA , new_maxA ] by computing

$$v'=\frac{v-min_{A}}{max_{A}-min_{A}}(new\_max_{A}- new\_min_{A})+new\_min_{A}$$

Z-score normalization − In z-score normalization (or zero-mean normalization), the values for an attribute, A, are normalized based on the mean and standard deviation of A. A value, v, of A is normalized to v by computing


where A and σA are the mean and standard deviation, respectively, of attribute A. This method of normalization is useful when the actual minimum and maximum of attribute A are unknown, or when there are outliers that dominate the min-max normalization.

Decimal Scaling − Normalization by decimal scaling normalizes by changing the decimal point of values of attribute A. The number of decimal points moved based on the maximum absolute value of A. A value, v, of A is normalized to v by computing


Where j is the smallest integer such that Max (|v|)<1.