How can we use hub pages to find authoritative pages?

Data MiningDatabaseData Structure

A hub is a set of Web pages that supports sets of links to authorities. Hub pages cannot be prominent, or there can exist some links pointing to them; however, they supports links to a set of prominent sites on a general topic.

Such pages can be lists of recommended connections on single home pages, including recommended reference sites from a course home page, or professionally massed resource documents on commercial sites. Hub pages play an essential role of implicitly conferring authorities on a targeted topic.

In general, a good hub is a page that points to several good authorities; a good authority is a page indicated to by several good hubs. Such a mutual reinforcement relationship among hubs and authorities supports the mining of authoritative Web pages and automated discovery of high-quality Web architecture and resources.

An algorithm utilizing hubs, known as HITS (Hyperlink-Induced Topic Search), was produced as follows. First, HITS needs the query terms to collect a beginning set of, say, 200 pages from an index-based search engine. These pages design the core set.

Because several pages are presumably relevant to the search topic, some of them should include links to most of the prominent authorities. Hence, the core set can be expanded into a base set by involving some pages that the core-set pages link to and some pages that link to a page in the core set, up to a designated size cut-off including 1,000 to 5,000 pages (to be contained in the base set).

Second, a weight-propagation process is started. This iterative phase decides statistical estimates of hub and authority weights. There are the links among two pages with the similar Web domain (i.e., sending the same first level in their URLs) serve as a navigation service and therefore do not confer authority. Such links are unauthorized from the weight-propagation analysis.

Google’s PageRank algorithm depends on a same principle. By exploring Web links and textual context data, it has been documented that such systems can obtain superior-quality search results than those created by term-index engines like AltaVista and those generated by human ontologists including at Yahoo!

The link analysis algorithms depends on the following two assumptions. First, links send human endorsement. If there exists a link from page A to page B and these two pages are authored by several people, then the link uses that the author of page A found page B valuable. Therefore the significance of a page can be raised to those pages it links to. Second, pages that are co-cited by a specific page are likely associated to the same subject.

Updated on 17-Feb-2022 12:32:25