Data Warehousing is a technique that is mainly used to collect and manage data from various sources to give the business a meaningful business insight. A data warehouse is specifically designed to support management decisions.
In simple terms, a data warehouse refers to a database that is maintained separately from an organization’s operational databases. Data warehouse systems enables the integration of multiple application systems. They provide data processing by supporting a solid platform of consolidated, historical information for analysis.
Data warehouses generalize and consolidate information in the multidimensional area. The construction of data warehouses includes data cleaning, data integration,and data transformation and can be considered as an essential pre-processing step for data mining. These tools and utilities involve the following functions −
Data cleaning − Data cleaning means cleaning the data by filling in the missing values, smoothing noisy data, identifying and removing outliers, and removing inconsistencies in the data.
Missing Values − Missing values are filled with appropriate values. Following are the approaches to fill the values.
Noisy data − Noise is a random error or variance in a measured variable. Following are the smoothing techniques to handle noise which are as follows −
Binning − These techniques smooth out a sorted data value by advisory its “neighborhood,” namely, the values about the noisy data. The sorted values are assigned into multiple buckets or bins. Because binning methods consult the neighborhood of values, they implement local smoothing.
Regression − Data can be smoothed by fitting the records to a function, including with regression. Linear regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used to predict the other. Multiple linear regression is a continuation of linear regression, where more than two attributes are included and the data are fit to a multidimensional surface.
Clustering − Clustering helps in identifying the outliers. Similar values are organized into clusters and those values which fall outside the cluster are called outliers.
Data Transformation − In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following −
Smoothing − It can work to remove noise from the data. Such techniques include binning, regression, and clustering.
Aggregation − In aggregation, where summary or aggregation operations are applied to the data.
Generalization − In Generalization, where low-level or “primitive” (raw) data are restored by larger-level concepts through the use of concept hierarchies.
Load − It can sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions
Refresh − It can propagate the updates from the data sources to the data warehouse.