Graph Theory - HITS Algorithm



HITS Algorithm

The Hyperlink-Induced Topic Search (HITS) algorithm, developed by Jon Kleinberg in 1999, is a link analysis algorithm that assigns two types of scores to web pages based on their hyperlink structure.

HITS is used primarily to find authorities and hubs within a graph, which has applications in web search, social network analysis, and recommendation systems.

HITS works by analyzing the link structure of a graph and classifying nodes (web pages, for example) into two categories −

  • Hubs: Pages that have many outgoing links to other important pages.
  • Authorities: Pages that are linked to by many hubs and thus are considered authoritative sources of information.

Why Use the HITS Algorithm?

The HITS algorithm is useful in graph-based applications where we want to identify the most important nodes in terms of both hub-like and authority-like qualities. Main advantages of HITS are −

  • Capturing Two Types of Importance: Unlike PageRank, which focuses on node importance based on link structure, HITS separately evaluates nodes as hubs and authorities.
  • Effective for Directed Graphs: HITS is well-suited for analyzing directed graphs, such as the web, where links between pages have a direction.
  • Applications in Ranking and Search: It is commonly used in search engines to identify pages that are not only popular but also authoritative in a particular field or topic.

Basic Concepts in the HITS Algorithm

Before understanding how HITS works, let us first define the main concepts −

  • Hub Score (h): Measures how good a node is at linking to other important nodes. Hubs are useful because they connect to many authorities.
  • Authority Score (a): Measures how valuable a node is as a source of information. Authorities are important because many hubs link to them.
  • Directed Graph: HITS works on directed graphs, where edges have a direction, like hyperlinks between web pages.

How the HITS Algorithm Works

The HITS algorithm works by performing two types of updates iteratively −

  • Hub Update: The hub score of a node is updated based on the authority scores of the nodes it points to. In other words, a hub's importance increases as it points to more authoritative nodes.
  • Authority Update: The authority score of a node is updated based on the hub scores of the nodes that point to it. A node is considered more authoritative if it is linked to by more valuable hubs.

The HITS algorithm alternates between these two steps until the scores converge. The mathematical formulation for the algorithm is as follows −

  • Hub Update: hi> = aj, where aj are the authority scores of the nodes that are linked to node i.
  • Authority Update: ai = hj, where hj are the hub scores of the nodes that point to node i.

Both hub and authority scores are normalized after each update to ensure that they do not grow infinitely. The algorithm continues until the scores converge or a specified number of iterations is reached.

Steps to Execute HITS Algorithm

Let us go through a step-by-step execution of the HITS algorithm on a simple directed graph −

  • Step 1: Initialize hub and authority scores. Initially, all nodes are assigned an equal score. Typically, the initial scores are set to 1 for all nodes.
  • Step 2: Perform hub updates. For each node, update the hub score by summing the authority scores of the nodes it points to.
  • Step 3: Perform authority updates. For each node, update the authority score by summing the hub scores of the nodes that point to it.
  • Step 4: Normalize the scores. Normalize the hub and authority scores so that they sum to 1, ensuring the scores remain bounded.
  • Step 5: Repeat steps 24 until convergence. The algorithm iterates between hub and authority updates until the scores stabilize or the maximum number of iterations is reached.

Example: HITS Algorithm

Consider a simple directed graph with 4 nodes (A, B, C, D) and the following edges −

  • A B
  • A C
  • B C
  • C D
  • D A
HITS Algorithm

We will calculate the hub and authority scores for each node using the HITS algorithm.

First, initialize the hub and authority scores for all nodes to 1 −

hA = hB = hC = hD = 1
aA = aB = aC = aD = 1

Then, perform hub and authority updates iteratively. After each iteration, normalize the scores. This can be implemented in Python using NetworkX library −

import networkx as nx

# Create a directed graph
G = nx.DiGraph()
G.add_edges_from([('A', 'B'), ('A', 'C'), ('B', 'C'), ('C', 'D'), ('D', 'A')])

# Run HITS algorithm
hubs, authorities = nx.hits(G, max_iter=100, tol=1.0e-8)

print("Hub scores:", hubs)
print("Authority scores:", authorities)

Following is the output obtained −

Hub scores: {'A': 0.6180339887498949, 'B': 0.38196601125010515, 'C': 7.061116596578133e-17, 'D': -9.970423965990023e-17}
Authority scores: {'A': -1.6132484859218386e-16, 'B': 0.3819660112501052, 'C': 0.6180339887498949, 'D': 1.1425126651789398e-16}

Applications of the HITS Algorithm

The HITS algorithm has various applications, especially in scenarios where understanding the relationship between nodes is important −

  • Web Search: Ranking web pages by their authority and hub scores to provide better search results.
  • Social Network Analysis: Finding important people (hubs) and trusted sources (authorities) in a network.
  • Recommendation Systems: Suggesting products or content based on hub-like and authority-like behaviors of users and items.
  • Biological Network Analysis: Finding important genes or proteins in a network of interactions.

Comparison of HITS and PageRank

Both HITS and PageRank are link analysis algorithms, but they differ in main aspects −

Aspect PageRank HITS
Focus Focuses on overall importance of a node by considering its link structure and assigning a single rank to each node. Separates node importance into two scores: hub score and authority score, considering both outbound and inbound links.
Graph Type Can be applied to undirected graphs. Specifically used for directed graphs.
Best Applications Used for ranking web pages based on overall importance. More suitable for applications like topic discovery and web search, where both hub-like and authority-like qualities are relevant.

Challenges in the HITS Algorithm

While HITS is a powerful algorithm, it comes with several challenges −

  • Computational Complexity: The iterative nature of HITS can be computationally expensive, especially for large graphs.
  • Handling Dangling Nodes: Nodes with no outgoing edges (dangling nodes) can interfere with the convergence of the algorithm.
  • Topic Drift: In some cases, HITS can suffer from topic drift, where nodes with similar hubs but different authorities might get grouped together.
Advertisements