What are the estimation methods in data mining?

Tenfold cross-validation is the standard way of measuring the error rate of a learning scheme on a particular dataset; for reliable results, 10 times tenfold cross-validation. There are two methods are leave-one-out cross-validation and bootstrap.

Leave-One-Out Cross-Validation

Leave-one-out cross-validation is openly n-fold cross-validation, where n is the multiple instances in the dataset. Each instance in turn is left out, and the learning scheme is trained on all the remaining instances. It is calculated by its correctness on the remaining instance—one or zero for success or failure, accordingly. The results of all n judgments, one for each group of the dataset, are averaged, and that average defines the last error estimate.

This process is an interesting one for two reasons. First, the highest possible amount of record can be used for training in each case, which presumably improves the chance that the classifier is an authentic one.

Second, the procedure is deterministic − No random sampling is involved. There is no point in repeating it 10 times, or repeating it at all. The same result will be obtained each time. Set against this is the high computational cost because the whole learning phase should be executed n times and this is generally infeasible for high datasets.

The Bootstrap

The second estimation method we describe, the bootstrap, is based on the statistical procedure of sampling with replacement. Previously, whenever a sample was taken from the dataset to form a training or test set, it was drawn without replacement.

Most learning schemes can use the same instance twice, and it makes a difference in the result of learning if it is present in the training set twice. The idea of the bootstrap is to sample the dataset with replacement to form a training set. We will describe a particular variant, mysteriously (but for a reason that will soon become apparent) called the 0.632 bootstraps.

For this, a dataset of n instances is sampled n times, with restoration, to provide a different dataset of n instances. Because some elements in this second dataset will (almost certainly) be repeated, there must be some instances in the original dataset that have not been picked—we will use these as test instances.

The figure obtained by training a learning system on the training set and calculating its error over the test set will be a pessimistic estimate of the true error rate because the training set, although its size is n, nevertheless contains only 63% of the instances, which is not a great deal compared, for example, with the 90% used in tenfold cross-validation.