049 What Is the Core Idea of the Page Rank Algorithm

049 What is the Core Idea of the PageRank Algorithm #

Last week, we introduced the historical development of information retrieval systems, analyzed the multi-round scoring system of search systems, and delved into the core technology of inverted indexing.

This week, I want to share with you a research need that emerged after the rise of internet search engines, which is how to understand the relationships between web pages, especially how to extract features other than text from these relationships. Some of the core algorithms in this area have been important driving forces for improving the quality of search engines. Additionally, the algorithm we will share this week is also applicable to other information networks that can express information using node-to-node relationships.

Today, let’s take a look at using graphs to represent the relationships between web pages and the classic algorithm for calculating web page importance: PageRank.

A Brief History of PageRank #

Today, Sergey Brin and Larry Page, the founders of Google, are well-known figures in the tech industry. However, back in 1995, they were both doctoral students studying at Stanford University’s computer science department. During that time, the internet was still in its early stages, and Yahoo had emerged as the first giant of the information age. Brin and Page both wanted to create their own search engine. In the summer of 1998, they temporarily left their doctoral program at Stanford and devoted themselves full-time to the development of Google. They presented a summary of the entire project at the 1998 World Wide Web international conference (WWW7) (see reference [1]). This was the first complete description of the PageRank algorithm.

PageRank generated significant interest in academic circles, leading to various modifications, interpretations, and analyses of the algorithm. For a deeper understanding, I recommend reading reference [2].

Basic Principles of PageRank #

Let me first introduce the most basic form of PageRank, which was the original idea behind Google’s PageRank algorithm proposed by Brin and Page.

First, let’s look at the structure of each webpage. Each webpage has a collection of “outlinks”. Here, outlinks refer to the links from the current webpage to other pages. For example, if there is a link from page A to page B, then B is an outlink of A. Similarly, we can define “inlinks”, which are links from other pages that point to the current page. For example, if page C points to page A, then C is an inlink of A.

With the concept of inlinks and outlinks, let’s define the PageRank of a webpage. We assume that each webpage has a value called PageRank, which measures the importance of the page. This value is defined as the weighted sum of the PageRank values of all its inlinks.

So, what is the weight? For an inlink J of a webpage I, let’s assume that J has N outlinks. Then, the weight is equal to one over N. In other words, J distributes one Nth of its own PageRank to I. In this sense, the PageRank of I is the sum of the PageRanks of its inlinks, weighted by the ratio of their respective outlinks. Pages with more outlinks distribute less PageRank, while pages with fewer outlinks distribute more. This is a very intuitive definition.

However, this definition alone is not enough because, in this definition, the PageRank values of pages I and J, as well as any other pages, are unknown. We have unknowns on both sides of the equation, which seems like an unsolvable problem.

Brin and Page used an iterative algorithm in their paper. This algorithm is quite straightforward: since we don’t know the PageRank values, let’s assign them a set of initial values, where all pages have the same PageRank value. Then, based on the definition mentioned above, we update the PageRank values of all pages. We repeat this process until the PageRank values of all pages no longer change significantly, or until they converge to a fixed value. They showed in their paper that the PageRank values often converge after a relatively small number of iterations.

These are the basic principles and an iterative algorithm of the PageRank algorithm.

Improvement of the PageRank Algorithm #

By following the original PageRank algorithm we introduced earlier, Brin and Page quickly encountered problems.

The first problem is that some pages do not have outgoing links, such as certain PDF files or image files. Without outgoing links, these pages can only accumulate PageRank values distributed from incoming links, but cannot distribute their own PageRank values. As a result, these pages become “dangling” nodes. The biggest problem with dangling nodes is that it makes the calculation of PageRank fail to converge. These nodes become “black holes” for PageRank values, causing the PageRank values of the dangling nodes to increase continuously until they “drain” all the values from other incoming links.

To solve this problem, it is necessary to “drain” the dangling nodes and distribute their values. Sergei and Larry found a method for that. For each dangling node, they assumed that the node could randomly reach any other node in the entire network. This is equivalent to artificially connecting the dangling node to a node on all pages and allowing the PageRank of the current dangling node to be distributed “evenly” to all other nodes. This solves the problem of dangling nodes.

However, the original PageRank algorithm still has other problems. In order to ensure the convergence of PageRank and that it converges to a unique solution, we need a second improvement. The second improvement is that even if a page has natural outgoing links, we also need a mechanism to navigate from this page to any other page. This simulates the assumption that a user has already browsed to a certain page. On the one hand, the user can continue browsing through the outgoing links provided on this page, and on the other hand, the user can randomly jump to any other page.

With this mechanism, the allocation of PageRank naturally changes for all nodes. In the previous definition, each page only transferred its PageRank value to all its original outgoing links. But now, this is part of the “sharing”, and another part includes sharing its PageRank value with all pages. Of course, the total amount of the latter should be smaller than the former. Therefore, a parameter can be introduced here to control the proportion of following outgoing links versus jumping to other pages. Typically, this parameter ranges from 60% to 85%.

With these two improvements, every page on the entire network can actually reach any other page. In other words, the whole page network becomes a fully connected graph, and the PageRank algorithm has a unique convergent solution.

PageRank Analysis #

Shortly after the proposal of PageRank, scholars began to analyze the properties of the PageRank model and algorithm. It was quickly discovered that there are other methods to explain PageRank.

The first popular and more formal method to explain PageRank is to write the distribution equation we just mentioned in the form of a matrix. In this case, the entire algorithm becomes a process of solving the “left eigenvector” of a random matrix. This random matrix is the modified transition matrix we mentioned earlier. And the iterative method we just mentioned happens to be the “power method” for solving eigenvectors. With certain conditions for the random matrix, the power method will definitely obtain a unique solution.

Another explanation is to algebraically transform the matrix form we mentioned earlier, which means moving each term on both sides of the equation to one side, with the other side naturally becoming 0. Thus, the whole equation becomes a process of solving a “linear system”. In other words, the entire PageRank solving process can be explained from an algebraic perspective.

Summary #

Today, I have discussed an important branch of modern search technology, the core idea of the PageRank algorithm in link analysis. Let’s review the key points: first, we talked about the brief history of PageRank and the original definition and approach of the algorithm. Second, we discussed two advancements of PageRank. Third, we briefly introduced two interpretations of PageRank.

Lastly, I leave you with a question to ponder: besides the power iteration method, what other methods do you think can be used to compute PageRank values?

References

  1. Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Proceedings of the seventh international conference on World Wide Web 7 (WWW7), Philip H. Enslow, Jr. and Allen Ellis (Eds.). Elsevier Science Publishers B. V., Amsterdam, The Netherlands, The Netherlands, 107-117, 1998.
  2. Langville, Amy N.; Meyer, Carl D. Deeper Inside PageRank. Internet Math. no. 3, 335-380, 2003.

Paper Links