What is Data Cube Aggregations?



Data integration is the procedure of merging data from several disparate sources. While performing data integration, it must work on data redundancy, inconsistency, duplicity, etc. In data mining, data integration is a record preprocessing method that includes merging data from a couple of the heterogeneous data sources into coherent data to retain and provide a unified perspective of the data.

Data integration is especially important in the healthcare industry. Integrated data from several patient records and clinics assist clinicians in identifying medical disorders and diseases by integrating information from several systems into a single perspective of beneficial information from which useful insights can be derived.

Effective data collection and integration also improve medical insurance claims processing accuracy and ensure that patient names and contact information are recorded consistently and accurately. Interoperability refers to the sharing of information across different systems.

When we have data in the form different from the needed, then the aggregation methods can be applied to the attributes to obtain the desired attributes. For example, a shop has data consists of its quarterly sales for the year 2010 to 2012. The data is available in the quarterly form but there is a need to retrieve its annual sales. So, it is required to aggregate the data to find the desired output.

QuarterSalesQuarterSalesQuarterSalesYearSales
Year 2010Year 2011Year 2012Year Sales
Q1Rs.10000Q1Rs.8000Q1Rs.150002010Rs.1,30,000
Q2Rs.50000Q2Rs.15000Q2Rs.200002011Rs.53000
Q3Rs.40000Q3Rs.10000Q3Rs.400002012Rs.1,05,000
Q4Rs.30000Q4Rs.20000Q4Rs.30000

Sales per quarter from year 2010 to 2012 get aggregated into a single annual sales record.

Concept hierarchies may exist for each attribute, allowing the analysis of data at multiple levels of abstraction. For example, a hierarchy for a branch could allow branches to be grouped into regions, based on their address. Data cubes support quick access to pre-computed, summarized data, thus benefiting online analytical processing and data mining.

The cube generated at the lowest level of abstraction is defined as the base cuboid. The base cuboid should correspond to a single entity of interest, including sales or customers. In other words, the lowest level must be usable, or helpful for the analysis. A cube at the highest level of abstraction is the apex cuboid.

Data cubes generated for several levels of abstraction are defined as cuboids so that a data cube can instead define a lattice of cuboids. Each larger level of abstraction further decreases the resulting data size. When replying to data mining requests, the smallest available cuboid relevant to the given task should be used.


Advertisements