What are the approaches of Unsupervised Discretization?

Data MiningDatabaseData Structure

An attribute is discrete if it has an associatively small (finite) number of possible values while a continuous attribute is treated to have a huge number of possible values (infinite).

In other term, a discrete data attribute can be viewed as a function whose range is a finite group while a continuous data attribute is a function whose range is an infinite completely ordered group, generally an interval.

Discretization aims to decrease the number of possible values a continuous attribute takes by partitioning them into several intervals. There are two methods to the problem of discretization. One is to quantize every attribute in the absence of some knowledge of the classes of the instances in the training class so-called unsupervised discretization.

The second is to create the classes into account when discretizing supervised discretization. The former is the only possibility when dealing with clustering problems where the classes are unknown or non-existent.

The obvious way of discretizing a numeric attribute is to divide its range into a predetermined number of equal intervals: a fixed, data-independent yardstick. This is generally completed at the time when information is collected.

In the unsupervised discretization method, it runs the hazard of spoiling distinctions that would have turned out to be beneficial in the learning procedure by using gradations that are too rude or, that by the adverse option of boundary, needlessly lump together several instances of multiple classes.

Equal-width binning often distributes instances very raggedly − Some bins include several instances while others include none. This can seriously impair the ability of the attribute to help build good decision structures. It is superior to enable the intervals to be of multiple sizes, selecting them so that a similar number of training examples fall into each one.

This method is known as equal-frequency binning, breaks the attribute’s range into predetermined several bins based on the distribution of instances along that axis sometimes known as histogram equalization because if it can take a histogram of the text of the resulting bins it will be frequently flat. If it can see the multiple bins as a resource, this method develops the best use of it.

Equal-frequency binning is apparent to the instances’ classes, and this can generate bad boundaries. For example, if some instances in a bin have one class, and some instances in the following larger bin have another except for the first, which has the initial class, surely it creates sense to respect the class divisions and contains that first instance in the earlier bin, sacrificing the same-frequency property for the benefit of homogeneity.

Updated on 10-Feb-2022 11:54:18