What is Semi-Supervised Cluster Analysis?

Semi-supervised clustering is a method that partitions unlabeled data by creating the use of domain knowledge. It is generally expressed as pairwise constraints between instances or just as an additional set of labeled instances.

The quality of unsupervised clustering can be essentially improved using some weak structure of supervision, for instance, in the form of pairwise constraints (i.e., pairs of objects labeled as belonging to similar or different clusters). Such a clustering procedure that depends on user feedback or guidance constraints is known as semisupervised clustering.

There are several methods for semi-supervised clustering that can be divided into two classes which are as follows −

Constraint-based semi-supervised clustering − It can be used based on user-provided labels or constraints to support the algorithm toward a more appropriate data partitioning. This contains modifying the objective function depending on constraints or initializing and constraining the clustering process depending on the labeled objects.

Distance-based semi-supervised clustering − It can be used to employ an adaptive distance measure that is trained to satisfy the labels or constraints in the supervised data. Multiple adaptive distance measures have been utilized, including string-edit distance trained using Expectation-Maximization (EM), and Euclidean distance changed by the shortest distance algorithm.

An interesting clustering method, known as CLTree (CLustering based on decision TREEs). It integrates unsupervised clustering with the concept of supervised classification. It is an instance of constraint-based semi-supervised clustering. It changes a clustering task into a classification task by considering the set of points to be clustered as belonging to one class, labeled as “Y,” and inserts a set of relatively uniformly distributed, “nonexistence points” with a multiple class label, “N.”

The problem of partitioning the data area into data (dense) regions and empty (sparse) regions can then be changed into a classification problem. These points can be considered as a set of “Y” points. It shows the addition of a collection of uniformly distributed “N” points, defined by the “o” points.

The original clustering problem is thus changed into a classification problem, which works out a design that distinguishes “Y” and “N” points. A decision tree induction method can be used to partition the two-dimensional space. Two clusters are recognized, which are from the “Y” points only.

It can be used to insert a large number of “N” points to the original data can introduce unnecessary overhead in the calculation. Moreover, it is unlikely that some points added would truly be uniformly distributed in a very high-dimensional space as this can need an exponential number of points.