What is the example of data generalization and analytical generalization?

Data generalization summarizes data by replacing relatively low-level values (including numeric value for attribute age) with high-level concepts (including young, middle-aged, and senior). Therefore, it is a process that abstracts a huge set of task-relevant information in a database from a relatively low conceptual level to higher conceptual levels.

Following are the two approaches for the efficient and flexible generalization of large data sets −

OLAP approach − The data cube technology can be treated as a data warehouse-based, pre-computation-oriented, materialized view approach. It implements offline aggregation earlier an OLAP or data mining query is moved for processing.

Attribute-oriented induction approach − It is a relational database query-oriented, generalization-based, online data analysis approach. In attribute-oriented induction, first, the task-relevant information is collected using a relational database query and then generalization is implemented based on the examination of the multiple distinct values of each attribute in the relevant collection of data.

The generalization is implemented by attribute removal. By combining identical generalized tuples and accumulating their respective counts implement aggregation, decreasing the size of generalized data set and interactive presentation with users.

Basic principles of attribute-oriented induction approach −

  • Data focusing − Data must be task-related, such as dimensions and the result is the original relation.
  • Attribute-removal − It can choose the set of relevant attributes or remove attributes A if there is a huge set of specific values for A but there is no generalization operator on A, or A's higher-level concepts are defined in terms of additional attributes.
  • Attribute generalization − If there is a huge set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A.
  • Analytical characterization − It is a statistical approach for preprocessing data to filter out irrelevant attributes or rank the relevant attribute. Measures of attribute relevance analysis can be utilized to analyze irrelevant attributes that can be unauthorized from the concept description procedure. The inclusion of this preprocessing step into class characterization or comparison is defined as an analytical characterization.

Reasons for attribute relevance analysis

There are several reasons for attribute relevance analysis are as follows −

  • It can determine which dimensions should be included.

  • It can achieve a high level of generalization.

  • It can decrease the number of attributes that support us to understand patterns easily.

The basic concept behind attribute relevance analysis is to evaluate some measure that can compute the relevance of an attribute regarding a given class or approach. Such measures involve information gain, ambiguity, and correlation coefficient.