- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
What is Data Cleaning?
Data cleaning defines to clean the data by filling in the missing values, smoothing noisy data, analyzing and removing outliers, and removing inconsistencies in the data. Sometimes data at multiple levels of detail can be different from what is required, for example, it can need the age ranges of 20-30, 30-40, 40-50, and the imported data includes birth date. The data can be cleans by splitting the data into appropriate types.
Types of data cleaning
There are various types of data cleaning which are as follows −
Missing Values − Missing values are filled with appropriate values. There are the following approaches to fill the values.
The tuple is ignored when it includes several attributes with missing values.
The values are filled manually for the missing value.
The same global constant can fill the values.
The attribute mean can fill the missing values.
The most probable value can fill the missing values.
Noisy data − Noise is a random error or variance in a measured variable. There are the following smoothing methods to handle noise which are as follows −
Binning − These methods smooth out a arrange data value by consulting its “neighborhood,” especially, the values around the noisy information. The arranged values are distributed into multiple buckets or bins. Because binning methods consult the neighborhood of values, they implement local smoothing.
Regression − Data can be smoothed by fitting the information to a function, including with regression. Linear regression contains finding the “best” line to fit two attributes (or variables) so that one attribute can be used to forecast the other. Multiple linear regression is a development of linear regression, where more than two attributes are contained and the data are fit to a multidimensional area.
Clustering − Clustering supports in identifying the outliers. The same values are organized into clusters and those values which fall outside the cluster are known as outliers.
Combined computer and human inspection − The outliers can also be recognized with the support of computer and human inspection. The outliers pattern can be descriptive or garbage. Patterns having astonishment value can be output to a list.
Inconsistence data − The inconsistency can be recorded in various transactions, during data entry, or arising from integrating information from multiple databases. Some redundancies can be recognized by correlation analysis. Accurate and proper integration of the data from various sources can decrease and avoid redundancy.
- Related Articles
- What is the cleaning action of soaps?
- How is ultrasound used for cleaning?
- Before cleaning a vessel, I saw bacteria in it. I saw 5 baceteria on it. After cleaning the vessel I found only 60% of bacteria is vanished. So, what is count of bacteria that is still present in the vessel after cleaning?
- What are the other uses of toothpaste, apart from cleaning teeth?
- ReactJS – Cleaning up with useEffect hook
- Why does cleaning a computer matter?
- What is data?
- What is Data Dictionary
- What is Data Switching?
- What is Data Encoding?
- What is Data Dependency?
- What is Data Integrity?
- What is big data?
- What is Data Warehouse?
- What is Data Classification?
