- Trending Categories
- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP

- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who

Data integration is the phase of combining data from several disparate sources. While implementing data integration, it should work on data redundancy, inconsistency, duplicity, etc. In data mining, data integration is a data pre-processing technique that contains merging data from numerous heterogeneous data sources into coherent data to retain and support a consolidated perspective of the information.

It combines data from various sources into a coherent data store, including in data warehousing. These sources can involve multiple databases, data cubes, or flat files, etc. There are multiple issues to consider during data integration.

Schema integration and object matching can be complex. For example, matching the entity identification (emp_id in one database and emp_no in another database), such issues can be prevented using metadata.

Redundancy is another issue. An attribute including annual revenue, for instance, can be redundant if it can be derived from another attribute or set of attributes. Inconsistencies in attribute or dimension naming can also generate redundancies in the appearing data set.

Some redundancies can be discovered by correlation analysis. Given two attributes, such analysis can compute how strongly one attribute implies the other, based on the available data. For numerical attributes, it can evaluate the correlation between two attributes, A and B, by computing the correlation coefficient (also known as Pearson’s product-moment coefficient, named after its inventor, Karl Pearson). This is

$$r_{A,B}=\frac{\sum_{i=1}^{n}(a_{i}-A^{'})(b_{i}-B^{'})}{N\sigma _{A}\sigma _{B}}=\frac{\sum_{i=1}^{n}(a_{i}b_{i})-NA^{'}B^{'}}{N\sigma _{A}\sigma _{B}}$$

where N is the number of tuples, a_{i}and b_{i} are the respective values of A and B in tuple i, A^{’} and B^{’} are the respective mean values of A and B, σ_{A} and σ_{B} are the respective standard deviations of A and B and Σ(a_{i}b_{i}) is the sum of the AB cross-product that is, for each tuple, the value for A is multiplied by the value for B in that tuple.

Correlation does not imply causality. That is, if A and B are correlated, this does not necessarily imply that A causes B or that B causes A. For example, in analyzing a demographic database, it can find that attributes defining the multiple hospitals and the several car thefts in a region are correlated. This does not define that one causes the other. Both are generally connected to a third attribute, such as population.

A third important issue in data integration is the detection and resolution of data value conflicts. For example, for the same real-world entity, attribute values from multiple sources can differ. This can be because of differences in representation, scaling, or encoding.

- Related Questions & Answers
- What is the integration of a data mining system with a database system?
- What is System Integration Testing (SIT) with Example
- What are the available options for Data integration in SAP?
- Difference between vertical integration and horizontal integration
- What is Data Dictionary
- What is Data Switching?
- What is Data Encoding?
- What is Data Dependency?
- What is Data Integrity?
- What is big data?
- What is Data Mining?
- What is Data Cleaning?
- What is Data Reduction?
- What is Data Transformation?
- What is Data Discretization?

Advertisements