What is a distance-based outlier?

An object o in a data set S is a distance-based (DB) outlier with parameters p and d, i.e., DB (p, d), if minimum a fraction p of the objects in S lie at a distance higher than d from o. In other words, instead of depending on statistical tests, it can think of distance-based outliers as those objects who do not have enough neighbors.

The neighbors are represented based on distance from the given object. In comparison with statistical-based methods, distance-based outlier detection generalizes or merges the ideas behind discordancy testing for standard distributions. Hence, a distance-based outlier is also known as a unified outlier or UO-outlier.

Distance-based outlier detection prevents the excessive calculation that can be related to fitting the observed distribution into some standard distribution and in choosing discordancy tests. For some Discordancy tests, it can be displayed that if an object o is an outlier as per the given test, then o is also a DB (p, d) outlier for some properly represented p and d.

For instance, if objects that lie 3 or more standard deviations from the mean are treated to be outliers, considering a normal distribution, then this representation can be “unified” by a DB(0.9988, 0.13s)-an outlier. There are several efficient algorithms for mining distance-based outliers that have been created which are as follows −

Index-based algorithm − Given a data set, the index-based algorithm facilitates multidimensional indexing structures, including R-trees or k-d trees, to search for neighbors of each object o inside radius d around that object. Let M be the maximum number of objects within the d-neighborhood of an outlier. Hence, once M + 1 neighbors of object o are discovered, it is accessible that o is not an outlier. This algorithm has the lowest case complexity of O (k * n2), where k is the dimensionality, and n is the number of objects in the data set.

Nested-loop algorithm − The nested-loop algorithm has the same evaluation complexity as the index-based algorithm but avoids index structure construction and tries to minimize the number of I/O’s. It divides the memory buffer areas into two halves, and the data is set into several logical blocks.

Cell-based algorithm − It can avoid O(n2) computational complexity, a cell-based algorithm was developed for memory-resident data sets. Its complexity is O (ek + n), where c is a constant based on the number of cells, and k is the dimensionality.

In this method, the data space is partitioned into cells with a side length similar to $\frac{d}{\sqrt[2]{k}}$. Each cell has two layers surrounding it.

The first layer is one cell thick, while the second is $\sqrt[2]{k}$ cells thick, rounded up to the closest integer. The algorithm counts outliers on a cell-by-cell instead of an object-by-object basis. For a given cell, it accumulates three counts including the number of objects in the cell, in the cell and the first layer together, and in the cell and both layers together.

Updated on: 25-Nov-2021

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started