Data Preprocessing in Data Mining


Data preprocessing is an important process of data mining. In this process, raw data is converted into an understandable format and made ready for further analysis. The motive is to improve data quality and make it up to mark for specific tasks.

Tasks in Data Preprocessing

Data cleaning

Data cleaning help us remove inaccurate, incomplete and incorrect data from the dataset. Some techniques used in data cleaning are −

Handling missing values

This type of scenario occurs when some data is missing.

  • Standard values can be used to fill up the missing values in a manual way but only for a small dataset.

  • Attribute's mean and median values can be used to replace the missing values in normal and non-normal distribution of data respectively.

  • Tuples can be ignored if the dataset is quite large and many values are missing within a tuple.

  • Most appropriate value can be used while using regression or decision tree algorithms

Noisy Data

Noisy data are the data that cannot be interpreted by machine and are containing unnecessary faulty data. Some ways to handle them are −

  • Binning − This method handle noisy data to make it smooth. Data gets divided equally and stored in form of bins and then methods are applied to smoothing or completing the tasks. The methods are Smoothing by a bin mean method(bin values are replaced by mean values), Smoothing by bin median(bin values are replaced by median values) and Smoothing by bin boundary(minimum/maximum bin values are taken and replaced by closest boundary values).

  • Regression − Regression functions are used to smoothen the data. Regression can be linear(consists of one independent variable) or multiple(consists of multiple independent variables).

  • Clustering − It is used for grouping the similar data in clusters and is used for finding outliers.

Data integration

The process of combining data from multiple sources (databases, spreadsheets,text files) into a single dataset. Single and consistent view of data is created in this process. Major problems during data integration are Schema integration(Integrates set of data collected from various sources), Entity identification(identifying entities from different databases) and detecting and resolving data values concept.

Data transformation

In this part, change in format or structure of data in order to transform the data suitable for mining process. Methods for data transformation are −

Normalization − Method of scaling data to represent it in a specific smaller range( -1.0 to 1.0)

Discretization − It helps reduce the data size and make continuous data divide into intervals.

Attribute Selection − To help the mining process, new attributes are derived from the given attributes.

Concept Hierarchy Generation − In this, the attributes are changed from lower level to higher level in hierarchy.

Aggregation − In this, a summary of data gets stored which depends upon quality and quantity of data to make the result more optimal.

Data reduction

It helps in increasing storage efficiency and reducing data storage to make the analysis easier by producing almost the same results. Analysis becomes harder while working with huge amounts of data, so reduction is used to get rid of that.

Steps of data reduction are −

Data Compression

Data is compressed to make efficient analysis. Lossless compression is when there is no loss of data while compression. loss compression is when unnecessary information is removed during compression.

Numerosity Reduction

There is a reduction in volume of data i.e. only store model of data instead of whole data, which provides smaller representation of data without any loss of data.

Dimensionality reduction

In this, reduction of attributes or random variables are done so as to make the data set dimension low. Attributes are combined without losing its original characteristics.

Conclusion

This article comprises of data preprocessing which help data to get converted into usable format. Tasks which helps data preprocessing are Data cleaning, data integration, data transformation and data reduction. Data cleaning remove incomplete data by handling missing values and smoothing noises with the help of binning, regression and clustering. Data integration combines data from multiple source to make a single dataset. Data transformation help change the format of data by using discretization, attribute selection, concept hierarchy generation and aggregation to make the data usable for mining. Data reduction helps with reducing storage of data to make the analysis easier with the help of some steps like data compression, numerosity reduction and dimensionality reduction.

Updated on: 22-Aug-2023

9K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements