What are the challenges of Outlier Detection in High-Dimensional Data?

Data MiningDatabaseData Structure

Best Seller

89 Lectures 11.5 hours

Practical Data Science using Python

22 Lectures 6 hours

Data Science and Data Analysis with Python

50 Lectures 3.5 hours

There are various challenges of outlier detection in high-dimensional data are as follows −

Interpretation of outliers − They must be able to not only identify outliers, but also support an interpretation of the outliers. Because several features (or dimensions) are contained in a high-dimensional data set, identifying outliers without supporting some interpretation as to why they are outliers is not very helpful.

The interpretation of outliers can appear from definite subspaces that manifest the outliers or an assessment concerning the “outlierness” of the objects. Such interpretation can support users to learn the possible meaning and importance of the outliers.

Data sparsity − The methods must be capable of managing sparsity in highdimensional areas. The distance among objects becomes heavily dominated by noise as the dimensionality improves. Thus, data in high-dimensional areas are sparse.

Data subspaces − They should model outliers suitably, for instance, adaptive to the subspaces signifying the outliers and getting the local behavior of information. It can be using a fixed-distance threshold against some subspaces to identify outliers is not a best idea because the distance among two objects monotonically increases as the dimensionality increases.

Scalability with respect to dimensionality − As the dimensionality increases, the multiple subspaces improves exponentially. An exhaustive combinatorial analysis of the search space, which includes some possible subspaces, is not a scalable method.

Outlier detection methods for high-dimensional data can be divided into three main methods are as follows −

Extending Conventional Outlier Detection − One method for outlier detection in high-dimensional data improves conventional outlier detection methods. It need the conventional proximity-based models of outliers. It can overcome the deterioration of proximity measures in high-dimensional spaces, it need substitute measures or constructs subspaces and detects outliers there.

The HilOut algorithm is an instance of this method. HilOut discovers distance-based outliers, but need the ranks of distance rather than the absolute distance in outlier detection. Particularly, for each object, o, HilOut discovers the k-nearest neighbors of o, indicated by nn1(o),...,nnk(o), where k is a software-dependent parameter.

The weight of object o is represented as

$$\mathrm{w(o) = \displaystyle\sum\limits_{i=1}^k dist(o,nn_{i}(o))}$$

Finding Outliers in Subspaces − The other method for outlier detection in high-dimensional data is to find for outliers in several subspaces. A specific benefit is that, if an object is discovered to be an outlier in a subspace of much lower dimensionality, the subspace supports critical data for executing why and to what extent the object is an outlier. This is hugely valuable in applications with high-dimensional data because of the overwhelming number of dimensions.

Modeling High-Dimensional Outliers − An alternative method for outlier detection methods in high-dimensional data attempt to produce new models for high-dimensional outliers precisely.

Updated on 18-Feb-2022 10:32:37