What are the requirements of clustering in data mining?

Data MiningDatabaseData Structure

There are the following requirements of clustering in data mining which are as follows −

Scalability − Some clustering algorithms work well on small data sets including fewer than some hundred data objects. A huge database can include millions of objects. Clustering on a sample of a given huge data set can lead to partial results. Highly scalable clustering algorithms are required.

Ability to deal with different types of attributes − Some algorithms are designed to cluster interval-based (numerical) information. However, applications can require clustering several types of data, including binary, categorical (nominal), and ordinal data, or a combination of these data types.

Discovery of clusters with arbitrary shape − Some clustering algorithms determine clusters depending on Euclidean or Manhattan distance measures. Algorithms that depend on such distance measures tend to discover spherical clusters with the same size and density. But, a cluster can be of any shape. It is essential to develop algorithms that can recognize clusters of arbitrary shapes.

Minimal requirements for domain knowledge to determine input parameters − Some clustering algorithms needed users to input specific parameters in cluster analysis (including the number of desired clusters). The clustering results can be absolutely sensitive to input parameters. Parameters are difficult to decide, especially for data sets including high-dimensional objects. This not only task users, but it also creates the quality of clustering difficult to control.

Ability to deal with noisy data − Most real-world databases include outliers or missing, unknown, or erroneous information. Some clustering algorithms are keen on such data and can lead to clusters of poor quality.

Incremental clustering and insensitivity to the order of input records − Some clustering algorithms cannot include newly inserted information (i.e., database updates) into current clustering structures and, instead, must decide a new clustering from scratch.

Some clustering algorithms are sensitive to the order of input records. Given a set of data objects, including algorithm can return dramatically different clusterings depending on the order of presentation of the input objects. It is essential to develop incremental clustering algorithms and algorithms that are insensitive to the order of input.

High dimensionality − A database or a data warehouse can include multiple dimensions or attributes. Some clustering algorithms are good at managing low-dimensional data, containing only two to three dimensions. Human eyes are best at determining the quality of clustering for up to three dimensions. It is used to find clusters of data objects in high-dimensional space is complex, especially treating that such data can be inadequate and highly skewed.

Published on 24-Nov-2021 06:55:16