What are the tasks in data preprocessing?

There are the major steps involved in data preprocessing, namely, data cleaning, data integration, data reduction, and data transformation as follows −

Data Cleaning − Data cleaning routines operate to “clean” the information by filling in missing values, smoothing noisy information, identifying or eliminating outliers, and resolving deviation. If users understand the data are dirty, they are unlikely to trust the results of some data mining that has been used.

Moreover, dirty data can make confusion for the mining phase, resulting in unstable output. Some mining routines have some phase for dealing with incomplete or noisy information, they are not always potent. Instead, they can concentrate on preventing overfitting the information to the function being modeled.

Data Integration − Data integration is the procedure of merging data from several disparate sources. While performing data integration, it must work on data redundancy, inconsistency, duplicity, etc. In data mining, data integration is a record preprocessing method that includes merging data from a couple of the heterogeneous data sources into coherent data to retain and provide a unified perspective of the data.

Data integration is especially important in the healthcare industry. Integrated data from multiple patient data and clinics assist clinicians in recognizing medical disorders and diseases by integrating data from multiple systems into an individual perspective of beneficial data from which beneficial insights can be derived.

Data Reduction − The objective of Data reduction is to define it more compactly. When the data size is smaller, it is simpler to use sophisticated and computationally high-cost algorithms. The reduction of the data can be in terms of the multiple rows (records) or terms of the multiple columns (dimensions).

In dimensionality reduction, data encoding schemes are used so as to acquire a reduced or “compressed” description of the initial data. Examples involve data compression methods (e.g., wavelet transforms and principal components analysis), attribute subset selection (e.g., removing irrelevant attributes), and attribute construction (e.g., where a small set of more beneficial attributes is changed from the initial set).

In numerosity reduction, the data are restored by alternative, smaller description using parametric models such as regression or log-linear models or nonparametric models such as histograms, clusters, sampling, or data aggregation.

Data transformation − In data transformation, where data are transformed or linked into forms applicable for mining by executing summary or aggregation operations. In Data transformation, it includes −

Smoothing − It can work to remove noise from the data. Such techniques includes binning, regression, and clustering.

Aggregation − In aggregation, where summary or aggregation services are used to the data. For instance, the daily sales data can be aggregated to calculate monthly and annual total amounts. This procedure is generally used in developing a data cube for the analysis of the records at several granularities.