How to reduce the number of Predictors?

A frequent problem in data mining is that of utilizing a regression equation to forecast the value of a dependent variable when it can have several variables available to select as predictors in this model.

Another consideration favoring the inclusions of numerous variables in the hope that a previously hidden relationship will emerge. For example, a company found that customers who had purchased anti-scuff protectors for chair and table legs had lower credit risks.

There are several reasons for exercising caution before throwing all possible variables into a model.

It can be highly-priced or not feasible to set a full complement of predictors for expected predictions.
It can be able to compute fewer predictors more correctly (e.g., in surveys).
The more predictors, the higher the chance of missing values in the data. If we delete or impute records with missing values, multiple predictors will lead to a higher rate of record deletion or imputation.
Parsimony is an essential feature of good models. We obtain more insight into the influence of predictors in models with few parameters.
Estimates of regression coefficients are likely to be ambiguous, because of multicollinearity in models with several variables. (Multicollinearity is the presence of two or more predictors sharing the same linear relationship with the outcome variable.)
Regression coefficients are strong for parsimonious models. One very rough rule of thumb is to have several records n larger than 5(p + 2), where p is the number of predictors.
It can be shown that using predictors that are uncorrelated with the outcome variable increases the variance of predictions.
It can be shown that dropping predictors that are correlated with the outcome variable can increase the average error (bias) of predictions.

The final two points define that there is a trade-off between too few and too many predictors. In general, accepting some biases can reduce the variance in predictions. This bias–variance trade-off is particularly essential for multiple predictors because it is likely that there are variables in the model that have small coefficients corresponding to the standard deviation of the noise and also viewing at least moderate correlation with other variables.

Dropping such variables will improve the predictions, as it reduces the prediction variance. This kind of bias-variance trade-off is an essential element of data mining procedures for prediction and classification.

Ginni

Updated on: 2022-02-10T11:22:49+05:30

359 Views

Kickstart Your Career

Get certified by completing the course

Get Started