How does the discordancy testing work?

Data MiningDatabaseData Structure

A statistical discordancy test analysis two hypotheses; a working hypothesis and a different hypothesis. A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F, i.e., H: oi Î F, where i = 1, 2, n.

The hypothesis is retained if there is no statistically important evidence supporting its rejection. A discordancy test checks whether an object oi is essentially large (or small) regarding the distribution F. Different test statistics have been proposed for use as a discordancy test, based on the available knowledge of the data.

Suppose that some statistic T has been selected for discordancy testing, and the value of the statistic for object oi is vi, then the distribution of T is constructed. Significance probability SP (vi) = Prob (T > vi) is evaluated.

If some SP (vi) is sufficiently small, then oi is discordant and the working hypothesis is rejected. An alternative hypothesis, which states that oi appears from another distribution model, G, is adopted. The result is very much based on which F model is chosen because oi can be an outlier under one model and a completely valid value under another.

The alternative distribution is very essential in deciding the power of the test, i.e. the probability that the working hypothesis is rejected when oi is an outlier. There are several types of alternative distributions.

Inherent alternative distribution − In this case, the working hypothesis that all of the objects come from distribution F is rejected in favor of the alternative hypothesis that all of the objects increase from another distribution, G −

H: oi Î G, where i = 1, 2, ..., n

F and G can be different distributions or differ only in parameters of the same distribution. There are constraints on the form of the G distribution in that it should have the potential to make outliers. For example, it can have a different mean or dispersion, or a long tail.

Mixture alternative distribution − The mixture alternative states that discordant values are not outliers in the F populations, but contaminates from some other populations. In this case, the alternative hypothesis is −

H: oi Î (1 – l) F + lG, where i = 1, 2, ..., n

Slippage alternative distribution − This alternative states that all of the objects (apart from some prescribed small number) arise independently from the original model F with parameters m and s2, while the remaining objects are independent observations from a modified version of F in which the parameters have been changed.

Published on 24-Nov-2021 06:38:13