What is Outlier Detection?

An outlier is a data object that diverges essentially from the rest of the objects as if it were produced by several mechanisms. For the content of the demonstration, it can define data objects that are not outliers as “normal” or expected data. Usually, it can define outliers as “abnormal” data.

Outliers are data components that cannot be combined in a given class or cluster. These are the data objects which have several behavior from the usual behavior of different data objects. The analysis of this kind of data can be important to mine the knowledge.

Outliers are fascinating because they are suspected of not being created by the same structure as the rest of the data. Hence, in outlier detection, it is essential to justify why the outliers identified are produced by several mechanisms.

One-class classification is known as outlier (or novelty) detection because the learning algorithm can be used to differentiate among data that occurs normal and abnormal concerning the distribution of the training records.

For instance, by observing a social media website where new content is approaching, novelty detection can identify new subjects and trends promptly. Novel topics can originally appear as outliers.

Outlier detection and novelty detection share some similarities in modeling and detection approaches. But a critical difference among the two is that in novelty detection, once new subjects are confirmed, they are generally integrated into the model of general behavior so that follow-up instances are not considered outliers anymore.

A generic statistical method to one-class classification is to recognize outliers as instances that lie further a distance d from a given percentage p of the training information. Moreover, a probability density can be computed for the target class by fitting a statistical distribution, including a Gaussian, to the training information; some test instances with a low probability value can be apparent as outliers.

Multiclass classifiers can be tailored to the one-class position by fitting a boundary around the focus data and deeming examples that fall external to be outliers. The boundary can be created by fixing the inner workings of current multiclass classifiers including support vector machines.

These approaches rely massively on a parameter that decides how much of the target information is likely to be defined as outliers. If it is selected too conservatively, data in the focus class will erroneously be dropped. If it is selected too liberally, the model will overfit and reject too many legitimate records. The rejection rate generally cannot be modified during testing, because an appropriate parameter value is required to be selected at training time.

Updated on: 10-Feb-2022


Kickstart Your Career

Get certified by completing the course

Get Started