What is Data Characteristics?


The following are some characteristics of data that can strongly affect cluster analysis which is as follows −

High Dimensionality − In high-dimensional data sets, the traditional Euclidean concept of density, which is the several points per unit volume, becomes significant. It is considered that as the multiple dimensions increase, the volume increases growingly, and unless the multiple points grow exponentially with the multiple dimensions, the density tends to 0.

It can also proximity influence to become more uniform in high-dimensional areas. There is another method to consider this fact is that there are more dimensions (attributes) that contribute to the proximity among two points and this tends to create the proximity more uniform.

Because most clustering techniques depend on proximity or density, they can have difficulty with high-dimensional information. One method to address such issues is to employ dimensionality reduction methods.

Size − Some clustering algorithms that operate well for small or medium-size data sets are unable to manage higher data sets.

Sparseness − Sparse data includes asymmetric attributes, where zero values are not as important as non-zero values. Hence, similarity measures suitable for asymmetric attributes are generally used.

Noise and Outliers − A general point (outlier) can severely degrade the implementation of clustering algorithms, particularly algorithms including K-means that are prototype-based. In other terms, noise can cause techniques, including single links, to join clusters that must not be combined.

In general cases, algorithms for eliminating noise and outliers are used before a clustering algorithm is used. Moreover, some algorithms can identify points that define noise and outliers during the clustering phase and then remove them or otherwise remove their negative effects.

Type of Attributes and Data Set − Data sets can be of multiple types, including structured, graph, or ordered, while attributes can be categorical (nominal or ordinal) or quantitative (interval or ratio), and are binary, discrete, or continuous.

Multiple proximities and density measures are suitable for multiple types of data. In several situations, data can be required to be discretized or binarized so that the desired proximity measure or clustering algorithm can be utilized.

Another difficulty appears when attributes are of broadly multiple types, e.g., continuous and nominal. In this method, proximity and density are more complex to define and provide more ad hoc. Finally, specific data structures and algorithms can be required to manage certain types of data efficiently.

Scale − Multiple attributes such as height and weight, can be measured on multiple scales. These differences can powerfully affect the distance or similarity among two objects and, consequently, the outcome of a cluster analysis. Consider clustering a set of people depending on their heights, which are computed in meters, and their weights, which are computed in kilograms.

Updated on: 14-Feb-2022

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements