Sunday, February 9, 2014

Fast and Efficient Computation of PageRank


Team Members

Introduction

Nowadays everyone uses search engines and hence their underlying algorithms. One of the most popular search engines is Google. Along with other algorithms Google uses the PageRank to determine the importance of a website. The computation of the PageRank for billions of websites can be done fast and efficient via parallel computing.

The PageRank formula

The PageRank algorithms weighs websites according to their popularity. The more  incoming links from other domains a web page has, the more popular it is and the higher the PageRank. Let p_1 be a website then the PageRank PR(p_1) is  calculated by the sum \sum_{\all P_j\in M(p_i)} \frac{PR(p_j)}_{l(p_j)} of other websites which are linking to p_1 divided by l(p_j), all outgoing hostlinks to other domains. To prevent websites with no outgoing links to gain more PageRank than necessary, there is also a damping factor d which is added to the equation. (In general it turns out thatd=0.85 is a good damping factor). This leads to the following equation where N is the number of websites.


PR(p_i) = \frac{1-d}{N} + d \sum_{\all P_j\in M(p_i)} \frac{PR(p_j)}_{l(p_j)}

Another more mathematic way to look at the PageRank is to see it as a probability distrubution which reprents the likelihood that a real person randomly surfing on the web reaches the current page p_j.
We can rearrange the equation into matrizes and see that the PageRank values are entries of an eigenvector:

R = \begin{Bmatrix} PR(p_1) \\ PR(p_2) \\ \vdots  \\ PR(p_N) \end{Bmatrix}

with R as the solution of


R =  \begin{Bmatrix} (1-d)/N \\ (1-d)/N\\ \vdots  \\ (1-d)/N \end{Bmatrix} +d \begin{Bmatrix} l(p_1,p_1) & l(p_1,p_2) & \dots & l(p_1,p_N)\\ l(p_2,p_1) & \ddots &  & l(p_1,p_N) \\ \vdots &  &  l(p_i,p_j) &  \\ l(p_N2,p_1) & \dots &  & l(p_N,p_N) \end{Bmatrix} R

Because not every website is linked to another website, the big l(p_i,p_j)-matrix is a sparse matrix.

The PageRank computation

The web contains billions of websites with billions of links which are changing everyday. PageRank computation is an excellent application for parallel computing, because the PageRank has to be calculated very often and very fast on a huge set of data. Google allows you nowadays also to choose preferenced sites which probably gain more PageRank. The computation of PageRank can be done in different ways ( it’s solving a linear algebra eigenvector equation). We can choose an iterative way for solving the equation e.g. with the Jacobi-Method until it converges.

Iterative computation

As initial probability distribution at t=0 we assume


PR(p_i,0) = 1/N

and can compute

PR(p_i,t+1) = \frac{1-d}{N} + d \sum_{\all P_j\in M(p_i)} \frac{PR(p_j,t)}_{l(p_j)}.

The computation ends when, for some small \epsilon, the equation

|R(t+1) - R(t)| < \epsilon

fulfilled. Depending on \epsilon the computation needs usually 50-100 iterations.

Parallel Approach:

We plan to create a graph structure representing multiple websites, each with multiple links to other websites. Then, we will split this graph among multiple processors, where each processor gets a chunk of the graph (a sub-graph). Each processor will do a PageRank computation on its sub-graph, and send this value to the master processor. At the moment, we are still trying to determine how to communicate the requisite data values between processors, and how to do the final "housekeeping" with our master processor.

References:

[1] Fast Parallel PageRank: A Linear System Approach. Yahoo! Technical Report, 2004
[2] “Efficient Parallel Computation of PageRank“, Christian Kohlschütter, Paul-Alexandru Chirita, and Wolfgang Nejdl, Lecture Notes in Computer Science Volume 3936, 2006, pp 241-252, Hannover
[2] http://en.wikipedia.org/wiki/PageRank
[3] The Anatomy of a Large-Scale Hypertextual, Web Search Engine , Sergey Brin, Lawrence Page, Stanford University