What are interval-scaled variables?

Data MiningDatabaseData Structure

Interval-scaled variables are continuous data of an approximately linear scale. An examples such as weight and height, latitude and longitude coordinates (e.g., when clustering homes), and weather temperature. The measurement unit used can influence the clustering analysis.

For instance, changing data units from meters to inches for height, or from kilograms to pounds for weight, can lead to several clustering structure. In general, defining a variable in smaller units will lead to a higher range for that variable, and therefore a larger effect on the resulting clustering architecture.

It can prevent dependence on the choice of data units, the data must be standardized. Standardizing measurements tries to provide all variables an equal weight. This is especially useful when given no previous knowledge of the data. But in some applications, users can intentionally need to provide more weight to a specific set of variables than to others. For instance, when clustering basketball player candidates, it can prefer to provide more weight to the variable height.

It can be standardize data, one choice is to modify the original data to unit less variables. Given measurements for a variable f, this can be implemented as follows −

Calculate the mean absolute deviation, sf

$$\mathrm{s_{f}\:=\:\frac{1}{n}(|x_{1f}-m_{f}|+|x_{2f}-m_{f}|+\cdot\cdot\cdot+|x_{nf}-m_{f}|)}$$

where x1f … xnf are n measurements of f, and mf is the mean value of f, that is, $\mathrm{m_{f}\:=\:\frac{1}{n}(|x_{1f}|+|x_{2f}|+\cdot\cdot\cdot+|x_{nf}|)}$

Calculate the standardized measurement, or z-score −

$$\mathrm{z_{if}\:=\:\frac{x_{if}-m_{f}}{s_{f}}}$$

The mean absolute deviation, sf, is powerful to outliers than the standard deviation, $\mathrm{\sigma_{f}}$. When computing the mean absolute deviation, the deviations from the mean $\mathrm{(|x_{1f}-m_{f}|)}$ are not squared.

Therefore, the effect of outliers is decreased. There are powerful measures of dispersion, including the median absolute deviation. The benefit of using the mean absolute deviation is that the z-scores of outliers do not come too small; therefore, the outliers remain detectable.

Standardization may or may not be helpful in a specific application. Therefore the choice of whether and how to implement standardization must be left to the user. After standardization, or without standardization in specific applications, the dissimilarity (or similarity) among the objects defined by interval-scaled variables is generally computed based on the distance between each group of objects.

The famous distance measure is Euclidean distance, which is represented as

$$\mathrm{d(i, j)=\sqrt{(X_{i1}-X_{j1}})^2+{(X_{i2}-X_{j2}})^2+...+{(X_{in}-X_{jn}})^2}$$

where i = (xi1, xi2, … xin) and j = (xj1, xj2, … xjn) are two n-dimensional data objects. Another well-known metric is Manhattan (or city block) distance, described as

$$\mathrm{d(i, j)=|X_{i1}-X_{j1}|+ |(X_{i2}-X_{j2}|+...+|(X_{in}-X_{jn}|}$$

Both the Euclidean distance and Manhattan distance satisfy the following numerical requirements of a distance function −

  • d(i, j) ≥ 0: Distance is a nonnegative number.

  • d(i, i) = 0: The distance of an object to itself is 0.

  • d(i, j) = d(j, i): Distance is a symmetric function.

  • d(i, j) ≤ d(i, h)+d(h, j): It is going directly from object i to object j in space is no more than making a detour over any other object h (triangular inequality).

raja
Updated on 16-Feb-2022 12:01:16

Advertisements