What are the methods for Clustering with Constraints?

There are various techniques are required to handle specific constraints. The general principles of handling hard and soft constraints which are as follows −

Handling Hard Constraints − A general methods for handling difficult constraints is to strictly regard the constraints in the cluster assignment procedure. Given a data set and a group of constraints on examples (i.e., must-link or cannot-link constraints), how can we develop the k-means approach to satisfy such constraints? The COP-kmeans algorithm works as follows −

Generate super instances for must-link constraints − It can calculate the transitive closure of the must-link constraints. Therefore, all must-link constraints are considered as an equivalence relation. The closure provides one or several subsets of objects where some objects in a subset should be assigned to one cluster.

It can define such a subset, it can replace some objects in the subset by the mean. The super instance also produce a weight, which is the number of objects it defines. After this process, the must-link constraints are continually satisfied.

Conduct modified k-means clustering − In k-means, an object is created to the closest center. It can respect cannot-link constraints, and it change the center assignment process in k-means to a closest feasible center assignment.

When the objects are assigned to centers in sequence, at every step it can sure the assignments so far do not disrupt some cannot-link constraints. An object is assigned to the closest center so that the assignment respects some cannot-link constraints.

Because COP-k-means provides that no constraints are violated at each step, it does not needed any backtracking. It is a greedy algorithm for creating a clustering that satisfies all constraints, supported that no conflicts exist between the constraints.

Handling Soft Constraints − Clustering with soft constraints is an optimization issues. When a clustering disrupt a soft constraint, a penalty is required on the clustering. Hence, the optimization aim of the clustering includes two parts such as optimizing the clustering aspect and minimizing the constraint violation penalty. The objective service is a set of the clustering quality score and the penalty score.

Given a data set and a set of soft constraints on examples, the CVQE (Constrained Vector Quantization Error) algorithm strategy k-means clustering while enforcing constraint violation penalties. The objective function utilized in CVQE is the total of the distance used in k-means, modified by the constraint violation penalties, which are computed as follows −

Penalty of a must-link violation − If there is a must-link constraint on objects x and y, but they are created to two multiple centers, c1 and c2, accordingly, therefore the constraint is violated. As a result, dist (c1,c2), the distance among c1 and c2, is inserted to the objective function as the penalty.

Penalty of a cannot-link violation − If there is a cannot-link constraint on objects x and y, but they are created to a common center, c, therefore the constraint is violated. The distance, dist (c,c), between c and c is inserted to the objective function as the penalty.