Data Warehousing - Architecture
In this article, we will discuss the business analysis framework for data warehouse design and architecture of a data warehouse.
Business Analysis Framework
The business analyst get the information from the data warehouses to measure the performance and make critical adjustments in order to win over other business holders in the market. Having data warehouse has the following advantages for the business.
Since the data warehouse can gather the information quickly and efficiently therefore it can enhance the business productivity.
The data warehouse provides us the consistent view of customers and items hence help us to manage the customer relationship.
The data warehouse also helps in bringing cost reduction by tracking trends, patterns over a long period in a consistent and reliable manner.
To design an effective and efficient data warehouse we are required to understand and analyze the business needs and construct a business analysis framework. Each person has different views regarding the design of a data warehouse. These views are as follows:
The top-down view - This view allows the selection of relevant information needed for data warehouse.
The data source view - This view presents the information being captured, stored, and managed by operational system.
The data warehouse view - This view includes the fact tables and dimension tables.This represent the information stored inside the data warehouse.
The Business Query view - It is the view of the data from the viewpoint of the end user.
Three-Tier Data Warehouse Architecture
Generally the data warehouses adopt the three-tier architecture. Following are the three tiers of data warehouse architecture.
Bottom Tier - The bottom tier of the architecture is the data warehouse database server.It is the relational database system.We use the back end tools and utilities to feed data into bottom tier.these back end tools and utilities performs the Extract, Clean, Load, and refresh functions.
Middle Tier - In the middle tier we have OLAp Server. the OLAP Server can be implemented in either of the following ways.
By relational OLAP (ROLAP), which is an extended relational database management system. The ROLAP maps the operations on multidimensional data to standard relational operations.
By Multidimensional OLAP (MOLAP) model, which directly implements multidimensional data and operations.
Top-Tier - This tier is the front-end client layer. This layer hold the query tools and reporting tool, analysis tools and data mining tools.
Following diagram explains the Three-tier Architecture of Data warehouse:
Data Warehouse Models
From the perspective of data warehouse architecture we have the following data warehouse models:
The view over a operational data warehouse is known as virtual warehouse. It is easy to built the virtual warehouse.
Building the virtual warehouse requires excess capacity on operational database servers.
Data mart contains the subset of organisation-wide data.
This subset of data is valuable to specific group of an organisation
Note: in other words we can say that data mart contains only that data which is specific to a particular group. For example the marketing data mart may contain only data related to item, customers and sales. The data mart are confined to subjects.
Points to remember about data marts
window based or Unix/Linux based servers are used to implement data marts. They are implemented on low cost server.
The implementation cycle of data mart is measured in short period of time i.e. in weeks rather than months or years.
The life cycle of a data mart may be complex in long run if it's planning and design are not organisation-wide.
Data mart are small in size.
Data mart are customized by department.
The source of data mart is departmentally structured data warehouse.
Data mart are flexible.
The enterprise warehouse collects all the information all the subjects spanning the entire organization
This provide us the enterprise-wide data integration.
This provide us the enterprise-wide data integration.
The data is integrated from operational systems and external information providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.
This Component performs the operations required to extract and load process.
The size and complexity of load manager varies between specific solutions from data warehouse to data warehouse.
Load Manager Architecture
The load manager performs the following functions:
Extract the data from source system.
Fast Load the extracted data into temporary data store.
Perform simple transformations into structure similar to the one in the data warehouse.
Extract Data from Source
The data is extracted from the operational databases or the external information providers. Gateways is the application programs that are used to extract data. It is supported by underlying DBMS and allows client program to generate SQL to be executed at a server. Open Database Connection( ODBC), Java Database Connection (JDBC), are examples of gateway.
In order to minimize the total load window the data need to be loaded into the warehouse in the fastest possible time.
The transformations affects the speed of data processing.
It is more effective to load the data into relational database prior to applying transformations and checks.
Gateway technology proves to be not suitable, since they tend not be performant when large data volumes are involved.
While loading it may be required to perform simple transformations. After this has been completed we are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we need to perform the following checks:
Strip out all the columns that are not required within the warehouse.
Convert all the values to required data types.
Warehouse manager is responsible for the warehouse management process.
The warehouse manager consist of third party system software, C programs and shell scripts.
The size and complexity of warehouse manager varies between specific solutions.
Warehouse Manager Architecture
The warehouse manager includes the following:
The Controlling process
Stored procedures or C with SQL
Operations Performed by Warehouse Manager
Warehouse manager analyses the data to perform consistency and referential integrity checks.
Creates the indexes, business views, partition views against the base data.
Generates the new aggregations and also updates the existing aggregation. Generates the normalizations.
Warehouse manager Warehouse manager transforms and merge the source data into the temporary store into the published data warehouse.
Backup the data in the data warehouse.
Warehouse Manager archives the data that has reached the end of its captured life.
Note: Warehouse Manager also analyses query profiles to determine index and aggregations are appropriate.
Query Manager is responsible for directing the queries to the suitable tables.
By directing the queries to appropriate table the query request and response process is speed up.
Query Manager is responsible for scheduling the execution of the queries posed by the user.
Query Manager Architecture
Query Manager includes the following:
The query redirection via C tool or RDBMS.
Query Management tool.
Query Scheduling via C tool or RDBMS.
Query Scheduling via third party Software.
The following diagram shows the detailed information
The detailed information is not kept online rather is aggregated to the next level of detail and then archived to the tape. The detailed infomation part of data warehouse keep the detailed information in the starflake schema. the detailed information is loaded into the data warehouse to supplement the aggregated data.
Note: If the detailed information is held offline to minimize the disk storage we should make sure that the data has been extracted, cleaned up, and transformed then into starflake schema before it is archived.
In this area of data warehouse the predefined aggregations are kept.
These aggregations are generated by warehouse manager.
This area changes on ongoing basis in order to respond to the changing query profiles.
This area of data warehouse must be treated as transient.
Points to remember about summary information.
The summary data speed up the performance of common queries.
It increases the operational cost.
It need to be updated whenever new data is loaded into the data warehouse.
It may not have been backed up, since it can be generated fresh from the detailed information.