What is the Page rank algorithm in web mining?


PageRank is a method for rating Web pages objectively and mechanically, paying attention to human interest. Web search engines have to organize with inexperienced clients and pages manipulating conventional ranking services. Some evaluation methods which count replicable natures of Web pages are unimmunized to manipulation.

The task is to take advantage of the hyperlink structure of the Web to produce a global importance ranking of every Web page. This ranking is called PageRank.

The mechanism of the Web depends on a graph with about 150 million nodes (Web pages) and 1.7 billion edges (hyperlinks). If Web pages A and B link to page C, A and B are called the backlinks of C. In general, highly linked pages are more important. Thus they have more backlinks and the important backlinks are less in quantity.

For instance, a Web page with an individual backlink from Yahoo has to be ranked higher than a page with multiple backlinks from unknown or private sites. A Web page has a huge rank if the total of the ranks of its backlinks is too large.

The following is the simplified version of PageRank: Let u, v be Web pages. Therefore let Bu be the group of pages that point to u. Moreover, let Nv be the multiple links from v. Let c < 1 be a factor for normalization. It can describe a simple ranking R, which is a simplified interpretation of PageRank −

$$\mathrm{R(u)\:=\:c\displaystyle\sum\limits_{u\in{Bu}}\frac{R(v)}{N_v}}$$

The rank of a page is divided between its forward connections evenly to provide to the ranks of the pages they mark too. The equation is recursive but there is an issue with this simplified function.

If two Web pages point to each other but no other page while some other Web page points to one of them, a loop will be generated during the iteration. This loop will assemble the rank but will never share any ranks. This trap formed by loops in a graph without outedges is known as rank sinks.

The Page Rank algorithm begins with the conversion of every URL from the database into a number. The next phase is to save each hyperlink in a database using the integer IDs to recognize the Web pages. The iteration is initiated after sorting the link structure by the parent ID and removing dangling links.

The best initial assignment has to be selected to speed up convergence. The weights from the current time step are kept in memory and the previous weights are accessed on disk in linear time. After the weights have converged the dangling connection are inserted back and the rankings are recalculated. The calculation implements well but can be made quicker by easing the convergence criteria and using more effective optimization approaches.

Updated on: 16-Feb-2022

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements