What are the methods for Data Generalization and Concept Description?

Data generalization summarizes data by replacing relatively low-level values (such as numeric values for an attribute age) with higher-level concepts (such as young, middleaged, and senior). Given the high amount of data saved in databases, it is beneficial to be able to define concepts in concise and succinct terms at generalized (rather than low) methods of abstraction.

It is allowing data sets to be generalized at multiple levels of abstraction facilitates users in examining the general behavior of the data. Given the AllElectronics database, for instance, rather than examining single customer transactions, sales managers can prefer to view the data generalized to higher levels, including summarized by user groups as per the geographic regions, frequency of purchases per group, and users income. This leads us to the notion of concept description, which is a form of data generalization.

A concept generally defines as set of data including frequent buyers, graduate students, etc. As a data mining task, concept description is not a simple enumeration of the data. Instead, concept description generates descriptions for the characterization and comparison of the data. It is also known as class description, when the concept to be defined a class of objects.

Characterization supports a concise and succinct summarization of the given set of data, while concept or class comparison (also referred to as discrimination) supports descriptions comparing two or more sets of data. There are the following cases which are as follows −

Complex data types and aggregation − Data warehouses and OLAP tools are depends on a multidimensional data model that views information in the form of a data cube, including dimensions (or attributes) and measures (aggregate services).

However, several current OLAP systems confine dimensions to non-numeric records and measures to numeric information. The database can involve attributes of several data types, such as numeric, non-numeric, spatial, text, or image, which must be involved in the concept description.

User-control versus automation − On-line analytical processing in data warehouses is a user-controlled phase. The selection of dimensions and the software of OLAP services, including drill-down, roll-up, slicing, and dicing, are generally directed and managed by the users.

Although the control in several OLAP systems is user-friendly, users do need a best understanding of the importance of each dimension. Moreover, it can find a satisfactory description of the information, users can required to define a long series of OLAP operations.

It is desirable to have a more automated phase that supports users decide which dimensions (or attributes) must be included in the analysis, and the degree to which the given data set must be generalized in order to create an interesting summarization of the records.