Ways to Detect Anomalies in a Given Dataset

Machine Learning Artificial Intelligence MLOps

Introduction

Anomalies are values or data observations that are very different from the other observations in the existing datasets., Detecting and processing the anomalies become essential while building a machine learning model, as the quality of the data that is to be passed to the model should be fair enough to rely on. It is believed that high-quality datasets can give accurate and reliable information and result son even very poor-performing algorithms, and if the quality of the dataset is itself very poor, then there is very less probability of achieving a high-performing model.

This article will discuss the outliers, the core idea behind them, why we should detect them, and ways to detect them. This will help one to understand the concept of outliers, their role in model building, and detecting and treating them further.

What are Anomalies?

Anomalies or outliers are the data points in a given dataset which does not fit the existing other data observations. The outliers are data values or observations that are very high, very low, or very different from the other data observations.

The outliers can affect the performance of the model the most, and hence they should be detected and treated well. Data cleaning and preprocessing play a significant role in building an accurate and reliable model; here, outliers detection and removal are one the most complex and essential stages. One thing that should be noted here is that outliers detection is a risky task as well, as we are detecting the values that do not fit the normal data at this stage. Still, sometimes the outliers can also be helpful to us and provide us with information that normal data can not, so detecting and treating outliers requires both technical and domain expertise.

Anomalies Detection: Methods

There are mainly two methods for the detection and removal of the outliers −

Trimming
Capping

Trimming is a method where we the dataset and we remove the outlier or exclude the outliers by deciding the upper and lower limit of the dataset. This technique is one o the fastest techniques for detecting and removing outliers.

Capping is a method where we cap the data with some data observations, as the name suggests. Here the upper and lower limit of the data is decided, and the data is capped based on these limits.

Anomalies Detection: Z Score

Z score is one of the oldest and most reliable methods for detecting outliers. Here we use statistical methods to detect the outliers in the dataset. Every data that has numerical values will have some distribution or variance, which we can plot easily using different libraries in python. Here we can calculate the Z score of the data using the formula, and the data observations with values greater than 3 and less than -3 will be chosen as outliers.

The formula for the Z score can be −

Z = Xi - Mean(X)/ StdDev(X)

Z = Z score

Xi = Data Observation

Mean(X) = Mean of X

StdDev = Standard Deviation of X

Example

df['zscore'] = (df[‘x’] - df[‘x’].mean())/df[‘x’].std()
df[(df['zscore'] > 3) | (df['zscore'] < -3)]
new_df = df[(df['zscore'] < 3) & (df['zscore'] > -3)]

Anomalies Detection: Capping

Capping is also one of the most used outliers detection and removal methods. Here we use the standard deviation and mean of the data observation to detect the outliers and remove them. In this approach, we calculate the mean and standard deviation of the data. Then according to the values that we get from the upper and lower limit, we set threshold values for which the data observation having higher values than the upper limit will be considered as outliers and vice versa.

Here the upper and lower limit of the data is calculated using the below equations −

Upper Limit = Mean(X) + 3*(StdDev(X))

Lower Limit = Mean(X) - 3*(StdDev(X))

Example

upper_limit = df[‘x’].mean() + 3*df[‘x’].std()
lower_limit = df[‘x’].mean() - 3*df[‘x’].std()
df[‘x’] = np.where(
    df[‘x’]>upper_limit,
    upper_limit,
    np.where(
       df[‘x’]<lower_limit,
       lower_limit,
       df[‘x’]
   )
)

We can see in the above code that the values of the mean and standard deviation of the data decide the upper and lower limit. The data observations that have higher values than the upper limit and lower values than the lower limit are removed.

Anomalies Detection: IQR Method

The inter-quantile range method is also used to detect outliers with skewed data distribution. In this case, the upper and lower limit of the data is decided based on inter quantile range of the data. Here the data observations having values higher than Q3 + 1.5IQR are considered outliers, and data observations that have values lower than Q1 - 1.5 IQR are considered outliers.

Upper Limit = Q3 + 1.5IQR

Lower Limit = Q1 - 1.5IQR

Here IQR = Intern quanmtile range = Q3 - Q1

Q3 = 75th Percentile of the dataset

Q1 = 25th Percentile of the dataset

Example

percentile25 = df['placement_exam_marks'].quantile(0.25)
percentile75 = df['placement_exam_marks'].quantile(0.75)
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
df[df[‘new_data’] > upper_limit]
df[df[‘new_data’] < lower_limit]
new_df = df[df['placement_exam_marks'] < upper_limit]
new_df = df[df[‘new_data’] < upper_limit]
New_df.shape

Key Takeaways

Outliers are data observations with very high or shallow values than other observations in the dataset.
Outliers are one of the most critical parameters to deal with while cleaning and preprocessing the data.
Outliers should be detected and handled nicely to avoid poor-performing models.
We can calculate the z score of the data and then classify the outliers with values greater than 3 or less than -3.
We can also use the capping method, where we cap the data using upper and lower limits based on the standard deviation and mean of the data.
Inter quantile range method can also be used wherein we have skewed data. Here the upper and lower limits are decided based on IQR, 25th, and 75th percentile data.

Conclusion

In this article, we discussed the outliers, what outliers are, why we should detect them, and hope we can treat them. We also discussed the three methods for the detection and removal of the outliers based on the Z score of the dataset, the Capping method, and the interquartile range method. This will help one understand the concept of outliers better and help deal with them.

Parth Shukla

Updated on: 24-Feb-2023

108 Views

Kickstart Your Career

Get certified by completing the course

Get Started