Graph-Based Clustering



Graph-Based Clustering

Graph clustering is used to partition a graph into meaningful subgroups, ensuring that nodes within the same cluster are highly connected, while nodes in different clusters have fewer connections.

The goal is to detect natural divisions or communities within the graph, revealing hidden patterns and relationships.

In this tutorial, we will explore the fundamental concepts, algorithms, and real-world applications of graph-based clustering.

Why Use Graph-Based Clustering?

Graph-based clustering is helpful when data is naturally connected. Some main benefits are −

  • Understands Structural Relationships: Unlike traditional clustering, graph clustering considers both node attributes and edge connections.
  • Flexibility: It can be applied to weighted, directed, and dynamic graphs.
  • Handles Large Networks: Efficient algorithms exist for large-scale networks.
  • Easy to Interpret: The clusters often correspond to meaningful real-world communities.

Types of Graph Clustering

Graph-based clustering methods can be divided into the following types −

  • Community Detection: Groups nodes that are strongly connected.
  • Spectral Clustering: Using eigenvalues of graph Laplacian matrices to identify clusters.
  • Density-Based Clustering: Finding clusters based on node density in the graph.
  • Hierarchical Clustering: Constructing a hierarchy of clusters.

Common Graph Clustering Algorithms

There are many commonly used algorithms for grouping nodes in a graph such as −

  • Girvan-Newman Algorithm
  • Spectral Clustering
  • Louvain Algorithm
  • Markov Clustering (MCL)

Girvan-Newman Algorithm

The Girvan-Newman algorithm finds communities by iteratively removing the edges that connect the highest number of nodes, causing the graph to split into smaller groups i.e. clusters.

Steps for the Girvan-Newman Algorithm:

  • Compute the betweenness centrality for all edges.
  • Remove the edge with the highest betweenness centrality.
  • Repeat until the graph is splitted into desired clusters.

Example

The following example demonstrates how to implement the Girvan-Newman algorithm in Python using NetworkX library. It loads a sample graph, the Karate Club graph, and applies the girvan_newman() function to detect communities −

import networkx as nx
from networkx.algorithms.community import girvan_newman

# Load a sample graph
G = nx.karate_club_graph()  
comp = girvan_newman(G)
top_level_communities = next(comp)
print(top_level_communities)

The algorithm's output is the top-level communities, which are printed after the first iteration −

({0, 1, 3, 4, 5, 6, 7, 10, 11, 12, 13, 16, 17, 19, 21}, {2, 8, 9, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33})

Spectral Clustering

Spectral clustering is a technique that uses the eigenvalues of the graph Laplacian matrix to find clusters in a graph. It works by first transforming the graph's adjacency matrix into a Laplacian matrix, which captures the structure of the graph.

Then, dimensionality reduction is performed on this matrix to project the nodes into a lower-dimensional space.

Finally, clustering techniques like k-means are applied to the transformed data to group the nodes into clusters. This method is effective for finding clusters in non-convex shapes or graphs that are difficult to separate using traditional clustering algorithms.

Steps for Spectral Clustering:

  • Compute the graph Laplacian.
  • Extract the top k eigenvectors.
  • Apply k-means clustering to the eigenvectors.

Example

This example demonstrates how to implement spectral clustering using Python. It computes the Laplacian matrix of the graph, performs spectral embedding to reduce the graph's dimensions, and then applies k-means clustering to group the nodes into two clusters −

import numpy as np
import networkx as nx
from sklearn.cluster import KMeans
from scipy.sparse.linalg import eigsh

# Define a graph (e.g., using the Karate Club graph)
G = nx.karate_club_graph()

# Remove isolated nodes
G.remove_nodes_from(list(nx.isolates(G)))

# Ensure the graph is connected
if not nx.is_connected(G):
   print("Warning: The graph is disconnected.")
   components = list(nx.connected_components(G))
   print(f"Found {len(components)} connected components.")
   # Select the largest connected component for further analysis
   largest_component = max(components, key=len)
   G = G.subgraph(largest_component)
else:
   print("The graph is connected.")

# Compute the Laplacian matrix and convert it to float for numerical stability
L = nx.laplacian_matrix(G).toarray().astype(np.float64)

# Check for NaN or infinite values in the Laplacian matrix
print("Laplacian Matrix:")
print(L)
if np.any(np.isnan(L)) or np.any(np.isinf(L)):
   print("The Laplacian matrix contains NaN or infinite values.")
else:
   print("The Laplacian matrix is clean.")
    
   # Add a small epsilon to diagonal for numerical stability
   epsilon = 1e-6
   L += np.eye(L.shape[0]) * epsilon

   # Perform spectral clustering using eigenvalue decomposition (Laplacian eigenmap)
   try:
      # We will calculate the first 'k' eigenvectors for the embedding
      k = 2  # Number of clusters (2 in this case)
      eigenvalues, eigenvectors = eigsh(L, k=k, which='SM')

      # Normalize the eigenvectors row-wise to form the embedding
      embedding = eigenvectors / np.linalg.norm(eigenvectors, axis=1)[:, None]

   except ValueError as e:
      print(f"Error in eigenvalue computation: {e}")
   else:
      # Apply k-means clustering to the embedded data
      kmeans = KMeans(n_clusters=2).fit(embedding)
      labels = kmeans.labels_

      # Print the cluster labels for each node
      print("Cluster labels:")
      print(labels)

The output is the cluster labels for each node in the graph, which indicate how the nodes are grouped into clusters −

The graph is connected.
Laplacian Matrix:
[[42. -4. -5. ... -2.  0.  0.]
 [-4. 29. -6. ...  0.  0.  0.]
 [-5. -6. 33. ...  0. -2.  0.]
 ...
 [-2.  0.  0. ... 21. -4. -4.]
 [ 0.  0. -2. ... -4. 38. -5.]
 [ 0.  0.  0. ... -4. -5. 48.]]
The Laplacian matrix is clean.

Cluster labels:
[0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]

Louvain Algorithm

The Louvain algorithm is a community detection method that focuses on optimizing a measure called modularity. Modularity is a value that quantifies the strength of divisions in a network, specifically how well nodes are grouped into communities.

The Louvain algorithm works by iteratively merging smaller communities to form larger ones in a way that increases modularity. This approach is highly efficient and can be applied to large graphs, making it popular for detecting communities in networks like social media, biology, and transportation.

Steps for Louvain Algorithm:

  • Assign each node to its own community.
  • Merge communities to maximize modularity.
  • Repeat until no improvement is possible.

Example

This code uses the Louvain algorithm from the community package to detect communities in a graph G. It returns a dictionary where each node is assigned to a specific community, and then prints the partition (community assignments) −

import networkx as nx
import community as community_louvain
import matplotlib.pyplot as plt

# Create a graph (using Karate Club graph as an example)
G = nx.karate_club_graph()

# Apply Louvain algorithm to detect communities
partition = community_louvain.best_partition(G)

# Print the community each node belongs to
print(partition)

# Visualize the graph with the communities
pos = nx.spring_layout(G)
plt.figure(figsize=(8, 8))

# Draw the graph with node colors corresponding to their communities
nx.draw_networkx_nodes(G, pos, partition.keys(), node_size=700, cmap=plt.cm.jet, node_color=list(partition.values()))
nx.draw_networkx_edges(G, pos, alpha=0.5)
nx.draw_networkx_labels(G, pos, font_size=10)

plt.title("Louvain Community Detection")
plt.show()

Following is the output obtained along with the graph −

{0: 0, 1: 0, 2: 0, 3: 0, 4: 1, 5: 1, 6: 1, 7: 0, 8: 3, 9: 3, 10: 1, 11: 0, 12: 0, 13: 0, 14: 3, 15: 3, 16: 1, 17: 0, 18: 3, 19: 0, 20: 3, 21: 0, 22: 3, 23: 2, 24: 2, 25: 2, 26: 3, 27: 2, 28: 2, 29: 3, 30: 3, 31: 2, 32: 3, 33: 3}
Louvain Algorithm

Markov Clustering (MCL)

MCL (Markov Clustering) is an algorithm that simulates random walks on a graph to identify densely connected clusters. It works by iteratively expanding and contracting the graph using matrix operations.

Initially, it treats each node as a separate cluster. The algorithm then simulates random walks between nodes, using a process of multiplication and inflation to emphasize strongly connected clusters while ignoring weaker connections.

The result is a partition of the graph into clusters where nodes within the same cluster are more strongly connected to each other compared to nodes in different clusters.

Steps for Markov Clustering:

  • Expand: Compute random walks.
  • Inflate: Strengthen intra-cluster connections.
  • Repeat until convergence.

Applications of Graph-Based Clustering

Graph clustering is commonly used in various domains, such as −

  • Social Networks: Detecting user communities and recommending connections.
  • Biological Networks: Identifying functional modules in protein interaction networks.
  • Fraud Detection: Finding suspicious groups in financial transactions.
  • Document Clustering: Organizing text data into topic-based clusters.

Evaluating Graph Clustering Performance

To check how well clustering algorithms perform, we use different evaluation metrics:

  • Modularity: Shows how well the graph is divided into communities, with higher values indicating better division.
  • Normalized Mutual Information (NMI): Compares the similarity between the predicted clusters and the true clusters, with higher values meaning better matching.
  • Silhouette Score: Measures how similar a node is to its own cluster compared to other clusters, with higher scores indicating better clustering.
Advertisements