Constraint-based clustering finds clusters that satisfy user-stated preferences or constraints. It is based on the nature of the constraints, constraint-based clustering can adopt instead of different approaches. There are several categories of constraints which are as follows −
Constraints on individual objects − It can define constraints on the objects to be clustered. In a real estate application, for instance, one can like to spatially cluster only those luxury mansions worth over a million dollars. This constraint confines the collection of objects to be clustered. It can simply be managed by preprocessing (e.g., implementing selection using an SQL query), after which the problem decreases to an example of unconstrained clustering.
Constraints on the selection of clustering parameters − A user can like to set a desired area for each clustering parameter. Clustering parameters are generally quite specific to the given clustering algorithm. Examples of parameters contain k, the desired number of clusters in a k-means algorithm; or ε (the radius) and MinPts (the minimum number of points) in the DBSCAN algorithm.
Although such user-stated parameters can strongly hold the clustering results, they are generally confined to the algorithm itself. Therefore, their fine-tuning and processing are generally not treated as a form of constraint-based clustering.
Constraints on distance or similarity functions − It can define several distances or similarity functions for definite attributes of the objects to be clustered, or different distance measures for limited pairs of objects. When clustering sportsmen, for instance, it can use several weighting schemes for height, body weight, age, and skill level.
User-specified constraints on the properties of individual clusters − A user can like to specify desired features of the resulting clusters, which can strongly hold the clustering process.
Consider a package delivery company that would like to decide the locations for k service stations in a city. The company has a database of users that registers the user’s names, locations, length of time because the customers start using the company’s services, and average monthly price. It can formulate this location selection problem as an instance of unconstrained clustering using a distance function computed based on customer location.
A smarter method is to partition the customers into two classes − high-value customers (who required frequent, regular service) and ordinary customers (who require occasional service). It can save costs and support good service, the manager adds the following constraints −
Each station must serve a minimum of 100 high-value customers.
Each station must serve a minimum of 5,000 ordinary customers. Constraint-based clustering will consider such constraints during the clustering procedure.
Semi-supervised clustering based on “partial” supervision − The quality of unsupervised clustering can be essentially improved using some weak form of supervision. This can be in the form of pairwise constraints (i.e., pairs of objects labeled as owned by the same or different cluster). Such a constrained clustering process is known as semi-supervised clustering.