050 Classical Graph Algorithms Hits

050 Classical Graph Algorithms - HITS #

This week, we are sharing content on understanding the relationships between web pages. On Monday, we introduced how to express the relationships between web pages using graphs and calculate the importance of web pages using the classic algorithm, PageRank. Today, I will introduce the sister algorithm of PageRank: the HITS algorithm.

A Brief History of HITS #

HITS, an abbreviation for Hypertext-Induced Topic Search, is an algorithm invented by Jon Kleinberg, a computer science professor at Cornell University, in 1998. Interestingly, this was the same year that Larry Page and Sergey Brin published the PageRank algorithm, as we discussed on Monday.

First, let’s take a brief look at Jon Kleinberg himself. He was born in Boston, Massachusetts in 1971. In 1993, he graduated from Cornell University with a bachelor’s degree in computer science, and in 1996, he obtained his Ph.D. in computer science from the Massachusetts Institute of Technology. In 1998, Jon was doing postdoctoral research at IBM Almaden Research Center in Silicon Valley, United States, when he published the work on HITS. The publication took place at the 9th ACM-SIAM Symposium on Discrete Algorithms, held in San Francisco in 1998 (for more detailed discussion, please refer to the references).

Currently, Jon is a member of the National Academy of Engineering and the American Academy of Arts and Sciences. By the way, it’s worth mentioning that Jon’s younger brother, Robert Kleinberg, is also a faculty member in the computer science department at Cornell University.

Basic Principles of HITS #

Before introducing the basic principles of the HITS algorithm, let’s first review the network structure of web pages. Each web page has a set of “outlinks,” which refers to the pages that the current page links to. For example, if page A has a link to page B, then B is an outlink of A. Based on this definition, we can also define “inlinks” as the pages that link to the current page. For example, if page C links to page A, then C is an inlink of A.

To understand the HITS algorithm, we need to introduce the concepts of “authority” and “hub” nodes. So what do these two types of nodes mean?

HITS provides a cyclic definition: a good “authority” node has many “hub” nodes as its outlinks, while a good “hub” node points to many good “authority” nodes. We have already seen this kind of cyclic definition in the definition of PageRank.

Obviously, in order to mathematically express the relationship between authority and hub nodes, we need to prepare two values for each page. Intuitively, no page can be completely authoritative or completely a hub. Most pages transition between these two roles or play both roles at the same time.

Mathematically, for each page I, we use X to express the “authority value” of the page and Y to express the “hub value” of the page. The most intuitive definition for the authority value X of I is the sum of the hub values of all pages that have inlinks to I. Similarly, the hub value of I is the sum of the authority values of all pages that have outlinks from I. This is the original definition of the HITS algorithm.

As we can see, if the hub value of a page I is large, it means that page I is frequently linked to by good hub nodes, and therefore its own authority increases. Conversely, if page I frequently points to good authority nodes, its hub nature becomes important.

Of course, just like the PageRank value, X and Y in the HITS algorithm are also unknown in advance. Therefore, the key point of the HITS algorithm is to solve for X and Y. If we express X and Y for all pages in the form of vectors, the HITS algorithm can be written as X is the transpose of the matrix L multiplied by Y, and Y is the matrix L multiplied by X. Here, the matrix L is an adjacency matrix where each row and column represents whether two pages are connected. By algebraic manipulation, we can derive that X is actually a matrix A multiplied by X, where A is the transpose of L multiplied by L. Y is actually a matrix B multiplied by Y, where B is L multiplied by the transpose of L.

Therefore, there emerges an astonishing point: the HITS algorithm actually requires solving the principal eigenvector of matrix A or matrix B, which corresponds to the largest eigenvalue, to solve for X or Y. This is similar to the matrix representation used in PageRank. This means that although PageRank and HITS are completely different in terms of ideas and concepts, and initially have opposite definitions, after some transformation, they can both be regarded as some form of matrix eigenvalue problem.

In fact, representing a graph as a matrix and analyzing some characteristics of the graph through eigenvectors is an important branch in graph algorithms (of course, here we mainly refer to the eigenvector corresponding to the largest value, while other eigenvectors also have meanings). Since we already know that we need to compute the largest eigenvector, the “power method” used for calculating PageRank can also be used here, but we won’t go into detail here.

So how do we use the HITS algorithm for search? Here is how HITS was initially used.

First, we construct a “neighborhood graph” based on a query keyword. This graph includes all pages related to the query keyword. Here, we can simplify it as all pages containing the query keyword. This step can be easily achieved in modern search engines with the help of an inverted index.

Once we have this neighborhood graph, we can build an adjacency matrix based on this graph, and then calculate the authority value and hub value of these nodes through the adjacency matrix. After calculating these two sets of values, we can present two kinds of webpage ranking results to users based on different assumptions.

It is worth noting that PageRank is a “query-independent” algorithm, which means that the PageRank value of each page does not vary with different query keywords. On the other hand, HITS is a “query-dependent” algorithm. From this perspective, HITS and PageRank are fundamentally different.

Some characteristics of the HITS algorithm #

The HITS algorithm relies on iterative methods to calculate authority and hub values. You must be curious whether such calculations converge or need special handling like PageRank.

The answer is that HITS is guaranteed to converge. This is better than the original PageRank situation. However, in the original case, HITS does not necessarily converge to a unique set of authority and hub values, meaning that the solution is not unique. Therefore, we need to apply a similar treatment to HITS as we do with PageRank, which ensures that all nodes in the adjacency matrix of HITS can reach any other node, but with a very small probability. With this modification, HITS can converge to unique authority and hub values.

The advantage of the HITS algorithm is that it provides users with a new perspective. For the same query keyword, the authority ranking and hub ranking provided by HITS can help users understand their own needs.

Of course, the weakness of HITS also comes from its dependence on query keywords. If all calculations are deferred until the user enters the query keywords and the system needs to calculate and sort all authority and hub values within a reasonable response time, the computational complexity will be significant. Therefore, some researchers have started using a global web graph to pre-calculate authority and hub values for all pages. However, this approach loses the specific information about a particular keyword.

Summary #

Today, I have discussed the core ideas of the HITS algorithm with you. Let’s recap the key points: firstly, we talked about some brief history of HITS. Secondly, we discussed the original definition and algorithm of HITS, and compared it with PageRank, explaining their similarities and differences. Thirdly, we analyzed some characteristics of HITS.

In conclusion, I will leave you with a question to ponder: is there a way to merge the two rankings corresponding to authority values and hub values into one ranking?

References

  1. Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM 46, 5 (September 1999), 604-632,1999.

Paper Link