What is Data Integration?

Data integration is the phase of combining data from several disparate sources. While implementing data integration, it should work on data redundancy, inconsistency, duplicity, etc. In data mining, data integration is a data pre-processing technique that contains merging data from numerous heterogeneous data sources into coherent data to retain and support a consolidated perspective of the information.

It combines data from various sources into a coherent data store, including in data warehousing. These sources can involve multiple databases, data cubes, or flat files, etc. There are multiple issues to consider during data integration.

  • Schema integration and object matching can be complex. For example, matching the entity identification (emp_id in one database and emp_no in another database), such issues can be prevented using metadata.

  • Redundancy is another issue. An attribute including annual revenue, for instance, can be redundant if it can be derived from another attribute or set of attributes. Inconsistencies in attribute or dimension naming can also generate redundancies in the appearing data set.

  • Some redundancies can be discovered by correlation analysis. Given two attributes, such analysis can compute how strongly one attribute implies the other, based on the available data. For numerical attributes, it can evaluate the correlation between two attributes, A and B, by computing the correlation coefficient (also known as Pearson’s product-moment coefficient, named after its inventor, Karl Pearson). This is

$$r_{A,B}=\frac{\sum_{i=1}^{n}(a_{i}-A^{'})(b_{i}-B^{'})}{N\sigma _{A}\sigma _{B}}=\frac{\sum_{i=1}^{n}(a_{i}b_{i})-NA^{'}B^{'}}{N\sigma _{A}\sigma _{B}}$$

where N is the number of tuples, aiand bi are the respective values of A and B in tuple i, A and B are the respective mean values of A and B, σA and σB are the respective standard deviations of A and B and Σ(aibi) is the sum of the AB cross-product that is, for each tuple, the value for A is multiplied by the value for B in that tuple.

Correlation does not imply causality. That is, if A and B are correlated, this does not necessarily imply that A causes B or that B causes A. For example, in analyzing a demographic database, it can find that attributes defining the multiple hospitals and the several car thefts in a region are correlated. This does not define that one causes the other. Both are generally connected to a third attribute, such as population.

A third important issue in data integration is the detection and resolution of data value conflicts. For example, for the same real-world entity, attribute values from multiple sources can differ. This can be because of differences in representation, scaling, or encoding.