What are the techniques of Discretization and Concept Hierarchy Generation for Numerical Data?

Data MiningDatabaseData Structure

It is complex and laborious to define concept hierarchies for numerical attributes because of the broad diversity of applicable data ranges and the frequent updates of data values. There are various methods of concept hierarchy generation for numeric data are as follows −

Binning − Binning is a top-down splitting technique based on a defined number of bins. These methods are also used as discretization methods for numerosity reduction and concept hierarchy generation. These techniques can be used recursively to the resulting partitions to make concept hierarchies. Binning does not use class data and is, therefore, an unsupervised discretization technique. It is susceptible to the user-specified number of bins, and the presence of outliers.

Histogram Analysis − Like binning, histogram analysis is an unsupervised discretization technique because it does not use class data. Histograms partition the values for an attribute, A, into disjoint ranges known as buckets. In an equal-width histogram, for instance, the values are partitioned into equal-sized partitions or ranges for the price, where each bucket has a width of $10). With an equal frequency histogram, the values are partitioned so that, each partition contains the same number of data tuples.

The histogram analysis algorithm can be applied recursively to each partition to automatically generate a multilevel concept hierarchy, with the procedure terminating once a pre-specified number of concept levels has been reached.

A minimum interval size can also be used per level to control the recursive procedure. This specifies the minimum width of a partition or the minimum number of values for each partition at each level.

Entropy-Based Discretization − Entropy is generally used discretization measures. It was first introduced by Claude Shannon in their pioneering work on information theory and the concept of information gain.

Entropy-based discretization is a supervised, top-down splitting technique. It explores class distribution data in its computation and determination of split points (data values for partitioning an attribute range).

Cluster Analysis − Cluster analysis is a popular data discretization method. A clustering algorithm can be applied to discretize a numerical attribute, A, by partitioning the values of A into clusters or groups.

Clustering considers the distribution of A, as well as the closeness of data points, and therefore can produce high-quality discretization results. Clustering can be used to generate a concept hierarchy for A by following either a top-down splitting strategy or a bottom-up merging strategy, where each cluster forms a node of the concept hierarchy.

Updated on 19-Nov-2021 12:20:34