# What is Entropy-Based Discretization?

Data MiningDatabaseData Structure

#### Big Data Hadoop

Best Seller

89 Lectures 11.5 hours

#### Practical Data Science using Python

22 Lectures 6 hours

#### Data Science and Data Analysis with Python

50 Lectures 3.5 hours

Entropy-based discretization is a supervised, top-down splitting approach. It explores class distribution data in its computation and preservation of split-points (data values for separation an attribute range). It can discretize a statistical attribute, A, the method choose the value of A that has the minimum entropy as a split-point, and recursively divisions the resulting intervals to appear at a hierarchical discretization.

Specific discretization forms a concept hierarchy for A. Let D includes data tuples described by a group of attributes and a class-label attribute. The class-label attribute supports the class data per tuple. The basic approach for the entropy-based discretization of an attribute A inside the set is as follows −

Each value of A can be treated as a potential interval boundary or split-point (indicated split point) to partition the area of A. That is, a split-point for A can division the tuples in D into two subsets fulfilling the conditions A ≤ split point and A > split point, respectively, thereby making a binary discretization.

Entropy-based discretization uses data regarding the class label of tuples. It can define the intuition following entropy-based discretization, it should take a glimpse at classification. Suppose it is required to define the tuples in D by partitioning on attribute A and some split-point.

For example, if we had two classes, it can hope that some tuples of, say, class C1 will decline into one partition, and some tuples of class C2 will decline into the other partition. But this is unlikely. For instance, the first partition can include several tuples of C1, but also some of C2. This amount is known as the expected data requirement for defining a tuple in D based on partitioning by A. It is given by

$$\mathrm{Info_A(D)\:=\:\frac{\mid\:D_1\:\mid}{\mid\:D\:\mid}Entrophy(D_1)\:+\:\frac{\mid\:D_2\:\mid}{\mid\:D\:\mid}Entrophy(D_2)}$$

where D1 and D2 correspond to the tuples in D refreshing the conditions A ≤ split point and A > split point, accordingly; |D| is the number of tuples in D, etc. The entropy service for a given set is computed based on the class distribution of the tuples in the set.

For instance, given m classes, C1, C2... Cm, the entropy of D1 is

$$\mathrm{Entrophy(D_1)}\:=\:-\displaystyle\sum\limits_{i=1}^m P_i{\log_{2}(P_i)}$$

The phase of deciding a split-point is recursively used to each partition acquired, until some stopping criterion is met, including when the minimum data requirement on all student split-points is less than a small threshold, ε, or when the multiple is higher than a threshold, max_interval.