What are the criteria for selecting the data sources?

Data MiningDatabaseData Structure

There are various criteria for selecting the data sources which are as follows −

Data accessibility − If two possible feeds exist for the data, one is stored in binary files maintained by a set of programs written before the youngest project team member was born and the other is from a system that reads the binary documents and supports more processing, then the decision is obvious.

Data accuracy − As data is passed from system to system, many modifications are made. Sometimes data elements from other systems are added, and sometimes existing elements are processed to create new elements and other elements are dropped.

Each system performs its function well. However, it may become difficult or impossible to recognize the original data. In some cases, the data no longer represents what the business wants for analysis. If you provide the data from these downstream systems, the users may question the accuracy.

Project scheduling − In many organizations, the data warehouse project begins as part of a rewrite of an existing OLTP system. As the new system development project starts to unfold, it is the case that the business users who are securely convinced of the value of a data warehouse begin to insist that the data warehouse be implemented sooner rather than later.

To provide historical data, you need to include the data from the existing system in your data warehouse. If the rewrite of the old system is held up, the data warehouse can continue utilizing the current system. Once the new system is released for production, the data feeds can be switched to it. In many cases, it is possible to deliver the data warehouse before the new operating system can be completed.

Some dimensional information usually comes with the transaction or fact data, but it is usually minimal and often only in the form of codes. The additional attributes that the users can want and required are fed from several systems or joint master files.

In many instances, there can be multiple master files, especially for the customer dimension. There are often separate files that are used across an organization. Sales, Marketing, and Finance may have their customer master files.

There are two difficult issues as first, the customers who are included in these files may differ, and the attributes about each customer may differ. Second, the common information may not match. If it can have unlimited time and money it can pull rich data from all sources and then combine it into an individual comprehensive view of customers.

In most cases, there is not enough time or money to do that all at once. In these cases, it is recommended that the users prioritize the information, and you start with what you can and expand in the future.

raja
Updated on 09-Feb-2022 13:17:17

Advertisements