What is Data Cleaning?

Data MiningDatabaseData Structure

Data cleaning defines to clean the data by filling in the missing values, smoothing noisy data, analyzing and removing outliers, and removing inconsistencies in the data. Sometimes data at multiple levels of detail can be different from what is required, for example, it can need the age ranges of 20-30, 30-40, 40-50, and the imported data includes birth date. The data can be cleans by splitting the data into appropriate types.

Types of data cleaning

There are various types of data cleaning which are as follows −

  • Missing Values − Missing values are filled with appropriate values. There are the following approaches to fill the values.

    • The tuple is ignored when it includes several attributes with missing values.

    • The values are filled manually for the missing value.

    • The same global constant can fill the values.

    • The attribute mean can fill the missing values.

    • The most probable value can fill the missing values.

  • Noisy data − Noise is a random error or variance in a measured variable. There are the following smoothing methods to handle noise which are as follows −

    • Binning − These methods smooth out a arrange data value by consulting its “neighborhood,” especially, the values around the noisy information. The arranged values are distributed into multiple buckets or bins. Because binning methods consult the neighborhood of values, they implement local smoothing.

    • Regression − Data can be smoothed by fitting the information to a function, including with regression. Linear regression contains finding the “best” line to fit two attributes (or variables) so that one attribute can be used to forecast the other. Multiple linear regression is a development of linear regression, where more than two attributes are contained and the data are fit to a multidimensional area.

    • Clustering − Clustering supports in identifying the outliers. The same values are organized into clusters and those values which fall outside the cluster are known as outliers.

    • Combined computer and human inspection − The outliers can also be recognized with the support of computer and human inspection. The outliers pattern can be descriptive or garbage. Patterns having astonishment value can be output to a list.

  • Inconsistence data − The inconsistency can be recorded in various transactions, during data entry, or arising from integrating information from multiple databases. Some redundancies can be recognized by correlation analysis. Accurate and proper integration of the data from various sources can decrease and avoid redundancy.

Published on 19-Nov-2021 11:55:23