Multi-relational clustering is the process of partitioning data objects into a set of clusters based on their similarity, utilizing information in multiple relations. In this section, it can introduce CrossClus (Cross-relational Clustering with user guidance), an algorithm for multi-relational clustering that explores how to utilize user guidance in clustering and tuple ID propagation to avoid physical joins.
There is one major challenge in multi-relational clustering is that there are too many attributes in different relations, and usually, only a small portion of them are relevant to a specific clustering task.
Consider the computer science department database. It can order to cluster students, attributes cover many different aspects of information, such as courses taken by students, publications of students, advisors and research groups of students, and so on.
A user is usually interested in clustering students using a certain aspect of information (e.g., clustering students by their research areas). Users often have a good grasp of their application’s requirements and data semantics. Therefore, a user’s management in the structure of a simple query, and can be used to improve the efficiency and quality of high-dimensional multi-relational clustering.
CrossClus accepts user queries that contain a target relation and one or more pertinent attributes, which together specify the clustering goal of the user. CrossClus defines multi-relational attributes. A multi-relational attribute A’ is defined by a join path Rt ⋈ R1 … . ⋈ Rk an attribute Rk . A of Rk , and possibly an aggregation operator (e.g., average, count, max).
A’ is formally represented by [A’. join path, A’ .attr, A’ .aggr], in which A’. aggr is optional. A multi-relational attribute A’ is either a categorical feature or a numerical one, depending on whether Rk. A is categorical or numerical. If A’ is a categorical feature, then for a target tuple t, t. A’ represents the distribution of values among tuples in Rk that are joinable with t.
In the multi-relational clustering process, CrossClus needs to search pertinent attributes across multiple relations. CrossClus must address two major challenges in the searching process. First, the target relation, Rt , can usually join with each nontarget relation, R, via many different join paths, and each attribute in R can be used as a multi-relational attribute.
It is impossible to perform any kind of exhaustive search in this huge search space. Second, among the huge number of attributes, some are pertinent to the user query (e.g., a student’s advisor is related to her research area), whereas many others are irrelevant (e.g., a student’s classmates' personal information).