What is Data Discretization?

Data MiningDatabaseData Structure

The data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can be used to restore actual data values. It can be restoring multiple values of a continuous attribute with a small number of interval labels therefore decrease and simplifies the original information.

This leads to a concise, easy-to-use, knowledge-level representation of mining results. Discretization techniques can be categorized depends on how the discretization is implemented, such as whether it uses class data or which direction it proceeds (i.e., top-down vs. bottom-up). If the discretization process uses class data, then it can say it is supervised discretization. Therefore, it is unsupervised.

If the process begins by first discovering one or a few points (known as split points or cut points) to split the whole attribute range, and then continue this recursively on the resulting intervals, it is known as top-down discretization or splitting.

In bottom-up discretization or merging, it can start by considering all of the continuous values as potential split-points, removes some by merging neighbourhood values to form intervals, and then recursively applies this process to the resulting intervals. Discretization can be implemented recursively on an attribute to support a hierarchical or multi-resolution partitioning of the attribute values, referred to as a concept hierarchy.

Concept hierarchies are useful for mining at multiple levels of abstraction. A concept hierarchy for a given numerical attribute represents a discretization of the attribute. Concept hierarchies can be used to decrease the data by collecting and restoring low-level concepts (including numerical values for the attribute age) with higher-level concepts (including youth, middle-aged, or senior). Although detail is hidden by such data generalization, the generalized data can be more meaningful and simpler to execute.

This provides a consistent description of data mining results among several mining tasks, which is a common requirement. Also, mining on a reduced data set needed fewer input/output operations and is more able than mining on a higher, ungeneralized data set. Due to these advantages, discretization techniques and concept hierarchies are generally used before data mining as a preprocessing step, rather than during mining.

Several discretization methods can be used to automatically generate or dynamically refine concept hierarchies for numerical attributes. In addition, many hierarchies for categorical attributes are implicit inside the database design and can be automatically represented at the schema definition level.

Updated on 19-Nov-2021 12:19:05