What is Attribute Selection Measures?

Data MiningDatabaseData Structure

An attribute selection measure is a heuristic for choosing the splitting test that “best” separates a given data partition, D, of class-labeled training tuples into single classes.

If it can split D into smaller partitions as per the results of the splitting criterion, ideally every partition can be pure (i.e., some tuples that fall into a given partition can belong to the same class).

Conceptually, the “best” splitting criterion is the most approximately results in such a method. Attribute selection measures are called a splitting rules because they decides how the tuples at a given node are to be divided.

The attribute selection measure supports a ranking for every attribute defining the given training tuples. The attribute having the best method for the measure is selected as the splitting attribute for the given tuples.

If the splitting attribute is constant-valued or if it is restricted to binary trees, accordingly, either a split point or a splitting subset should also be decided as an element of the splitting criterion.

The tree node generated for partition D is labeled with the splitting criterion, branches are increase for each result of the criterion, and the tuples are isolated accordingly. There are three famous attribute selection measures including information gain, gain ratio, and gini index.

Information gain − Information gain is used for deciding the best features/attributes that render maximum data about a class. It follows the method of entropy while aiming at reducing the level of entropy, starting from the root node to the leaf nodes.

Let node N defines or hold the tuples of partition D. The attribute with the largest information gain is selected as the splitting attribute for node N. This attribute minimizes the data required to define the tuples in the resulting subdivide and reflects the least randomness or “impurity” in these subdivide.

Gain ratio − The information gain measure is biased approaching tests with several results. It can select attributes having a high number of values. For instance, consider an attribute that facilitates as a unique identifier, including product ID.

A split on product ID can result in a huge number of partitions, each one including only one tuple. Because each partition is authentic, the data needed to define data set D based on this partitioning would be Infoproduct_ID(D) = 0.

Gini index − The Gini index can be used in CART. The Gini index calculates the impurity of D, a data partition or collection of training tuples, as

$$\mathrm{Gini(D)=1-\displaystyle\sum\limits_{i=1}^m p_i^2}$$

where pi is the probability that a tuple in D belongs to class Ci and is calculated by |Ci,D|/|D|.

raja
Updated on 16-Feb-2022 11:46:57

Advertisements