What are the methodologies of statistical data mining?

Data MiningDatabaseData Structure

In statistical data mining techniques, it is created for the effective handling of large amounts of data that are generally multidimensional and possibly of several complex types.

There are several well-established statistical methods for data analysis, especially for numeric data. These methods have been used extensively to scientific records (e.g., records from experiments in physics, engineering, manufacturing, psychology, and medicine), and to information from economics and the social sciences.

There are various methodologies of statistical data mining are as follows −

Regression − In general, these techniques are used to forecast the value of a response (dependent) variable from new predictor (independent) variables, where the variables are numeric. There are several forms of regression, including linear, multiple, weighted, polynomial, nonparametric, and robust (robust methods are beneficial when errors declines to satisfy normalcy conditions or when the data include significant outliers).

Generalized linear models − These models and their generalization (generalized additive models), enable a categorical (nominal) response variable (several transformation of it) to be associated with a set of predictor variables in a manner same to the modeling of a mathematical response variable utilizing linear regression. Generalized linear models involve logistic regression and Poisson regression.

Analysis of variance − These method analyze experimental information for two or more populations defined by a numeric response variable and new categorical variables (factors). In general, an ANOVA (single-factor analysis of variance) problem contains a comparison of k population or treatment defines to decide if at least two of the means are different.

Mixed-effect models − These models are for exploring grouped data—data that can be classified as per the one or more grouping variables. They generally define relationships between a response variable and several covariates in data combined according to one or more factors. There are several areas of application such as multilevel data, repeated measures data, block designs, and longitudinal data.

Factor analysis − This method can determine which variables are combined to produce a given factor. For instance, for several psychiatric data, it is not applicable to compute a specific factor of interest directly (e.g., intelligence); however, it is applicable to measure other quantities that reflect the element of interest. Therefore, none of the variables is appropriated as dependent.

Discriminant analysis − This technique can predict a categorical response variable. Unlike generalized linear models, it considers that the independent variables follow a multivariate normal distribution. The process tries to decide several discriminant functions (linear set of the independent variables) that discriminate between the groups represented by the response variable. Discriminant analysis is generally used in social sciences.

Survival analysis − There are multiple well-established statistical methods exist for survival analysis. These techniques initially were designed to forecast the probability that a patient undergoing a medical analysis can survive at least to time t.

Quality control − There are multiple statistics is used to prepare charts for quality control, including Shewhart charts and CUSUM charts. These statistics involve the mean, standard deviation, range, count, moving average, moving standard deviation, and moving range.

Updated on 18-Feb-2022 10:40:01