- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to screen for outliners and deal with them?
Introduction
Data points that stand out from the bulk of other data points in a dataset are known as outliers. They can distort statistical measurements and obscure underlying trends in the data, which can have a detrimental effect on data analysis, modeling, and visualization. Therefore, before beginning any study, it is crucial to recognize and handle outliers.
In this post, we'll look at different methods for dealing with outliers as well as how to check for them.
Screening for Outliers
We must first recognize outliers in order to deal with them. Here are a few popular techniques for identifying outliers −
1. Visual Inspection
Visualizing the data using graphs and plots, such as box plots, scatter plots, and histograms is one method for finding outliers. A data point that considerably differs from the bulk of other data points is referred to as an outlier. By analyzing the plot, we can determine if the outliers are real or the result of mistakes or corrupted data.
2. Z-score
A statistical metric called the z-score counts the number of standard deviations a data point deviates from the mean. We can find data points that are considerably different from the majority of other data points by computing the z-score of each data point. A z-score of 3 or less is frequently regarded as an anomaly.
3. Interquartile Range (IQR)
The interval between the data's 25th percentile (Q1) and its 75th percentile (Q3) is known as the interquartile range. We can find data points that are considerably different from the majority of other data points by computing the IQR and multiplying it by a factor of 1.5. Any data point that is 1.5 or more times the IQR below Q1 or above Q3 is frequently regarded as an outlier.
Dealing with Outliers
After locating outliers, we must determine how to handle them. Here are a few typical methods for handling outliers −
1. Removal
Taking outliers out of the dataset is the easiest approach to handling them. This strategy should be employed with caution, though, as eliminating too many outliers might have a major negative influence on the dataset's statistical measurements and obscure key trends. It is crucial to record the procedure and the justification for deleting outliers when doing so.
2. Transformation
Transforming the data using mathematical functions like logarithmic, exponential, or power functions is another strategy for addressing outliers. By using this method, the extreme values of the dataset's statistical measures will have less of an influence and patterns will be simpler to spot.
3. Imputation
Imputation is the process of substituting estimated values for missing or anomalous data. Data may be imputed using a variety of techniques, including mean imputation, median imputation, and regression imputation. Although this method can add bias to the dataset and affect the accuracy of the study, it should be used with caution.
4. Segmentation
The process of segmenting a dataset involves breaking it up into smaller groups according to various traits or properties. We may study each group independently and find patterns that are exclusive to each group by segmenting the data. When dealing with outliers that are valid but reflect a distinct portion of the data, this strategy may be helpful.
Example
import pandas as pd import numpy as np from scipy import stats # Create a sample dataset data = pd.DataFrame({'value': [10, 9, 8, 7, 6, 555, 999, 5, 6]}) # Calculate z-scores for each value in the dataset z_scores = np.abs(stats.zscore(data)) # Identify outliers as any value with a z-score greater than 3 outliers = data[z_scores > 3] # Replace outliers with the median value of the dataset data[z_scores > 3] = data['value'].median() # Print the updated dataset without outliers print(data)
Output
value 0 10 1 9 2 8 3 7 4 6 5 555 6 999 7 5 8 6
Explanation
Using one column named value and 10 values, including an outlier with a value of 100, a sample dataset is produced.
The stats are used to determine the z-scores for each value in the dataset. from the SciPy package, the Z score function. A data point's Z-score indicates how many standard deviations it is from the mean.
Using the print function, the new dataset is printed sans outliers.
Given that we are only concerned with the size of the departure from the mean and not its direction, the np.abs function is used to get each z-absolute score's value.
The criteria z scores > 3 is used to identify any value with a z-score above 3 as an outlier.
Using the median function of the value, the outliers are replaced with the dataset's median value.
The outlier is located and eliminated from the dataset by the code using the z-score approach. The dataset's median value is used to replace the identified outliers. When the sample size is big or the data are regularly distributed, this strategy can be helpful.
It's important to keep in mind that there are other approaches to dealing with outliers, and the one used in this example is only one of them. Trimming, winsorizing, and utilizing machine learning algorithms that are resistant to outliers are further typical techniques. The best approach will rely on the particular traits of the dataset and the objectives of the investigation.
Conclusion
In summary, outliers can negatively affect data analysis, modeling, and visualization, thus it's critical to spot and handles them before beginning any study. We can make sure that our analysis is accurate and insightful by checking for outliers using visual inspection, z-score, and IQR, and then dealing with them using removal, transformation, imputation, or segmentation. However, it's crucial to employ these methods carefully and to record the procedure.
- Related Articles
- Hot Flash Triggers You Should Know About (and How to Deal With Them)
- How to Deal with Threats and Opportunities
- How to deal with Garbage?
- How Ringworm Spreads and, how to Deal with it?
- How to deal with stubborn children?
- How to Deal With Team Attrition
- How to Deal With Inconsiderate People?
- How to deal with a cynical boss?
- How to deal with a cranky baby?
- How to Deal with A Lying Spouse
- What is Agile Burnout and How to Deal with It?
- How to deal with error “$ operator is invalid for atomic vectors” in R?
- How to Deal with the Loss of a Parent: Psychologists’ Tips for Grieving
- How to deal with ModalDialog using selenium webdriver?
- How to deal with security certificates using Selenium?
