What is the C5 Pruning Algorithm?

Data MiningDatabaseData Structure

C5 is the current version of the decision-tree algorithm that Australian researcher, J. Ross Quinlan has been developing and refining for several years. A prior version, ID3, established in 1986, was influential in the area of machine learning and its successors are used in multiple commercial data mining services.

The trees increase by C5 are same to those improves by CART. Like CART, the C5 algorithm first improves an overfit tree and then prunes it back to make a more dynamic model. The pruning method is complex, but C5 does not create use of a validation set to select from between candidate subtrees.

The similar data used to increase the tree is also used to determine how the tree must be pruned. This can reflect the algorithm’s basis in the academic globe, where in the previous, university researchers had a complex time receiving their hands on substantial quantities of real record to use for training sets. Accordingly, they spent much time and effort attempting to coax the final some drops of data from their poor datasets—a problem that data miners in the business world do not look.

C5 prunes the tree by determining the error rate at each node and considering that the true error rate is considerably worse. If N records appears at a node, and E of them are defined incorrectly, therefore the error rate at that node is E/N.

C5 needs an analogy with statistical sampling to appear up with an estimate of the worst error cost likely to be view at a leaf. The analogy operates by thinking of the information at the leaf as defining the results of a sequence of trials each can have one of two feasible results.

C5 considers that the observed number of errors on the training record is the low end of this range, and substitutes the high end to get a leaf’s forecasted error cost, E/N on unseen record. The lower the node, the larger the error cost. When the high-end estimate of the multiple errors at a node is less than the estimate for the errors of its children, therefore the children are pruned.

The main goal of a model is to create consistent predictions on earlier unseen data. Some rule that cannot achieve that goal should be removed from the model. Some data mining tools enables the customer to prune a decision tree manually.

This is a helpful facility, but it can view forward to data mining software that supports automatic dynamic-based pruning as an option. Such application required to have a less subjective element for denial a split than “the distribution of the validation set results views different from the distribution of the training group results.

Updated on 15-Feb-2022 06:31:12