The PageRank algorithm created by Page and Brin is the heart of the Google search engine. Google is an overwhelming search engine, which is attributed to many factors, but the major factor for its success is the search algorithm it uses. This algorithm helps to display the most important search results first compared to other search engines. Page and Brin created this algorithm, while they were graduate students at Stanford University in 1998. It was discovered that others might express judgment when a person is browsing website links through another search engine.
In other words, a person has a perception that the Google engine is superior to others. Google search engine is focused on mining intelligence and based on this concept, the PageRank algorithm was created in order to determine the importance of all web pages on the internet. The aim of this paper is to survey the PageRank algorithm and discuss how it is used to rank pages on the web.
Origin of the PageRank Algorithm
PageRank is one of the earliest techniques for link-based analysis for increased efficiency in information retrieval on the web. Larry Page and Sergey Brin, the Google founders developed the PageRank algorithm at Stanford University in 1998 to serve as a system for ranking web pages based on the number of links they have. Page and Brin took a leave in August 1998, in order to fully cultivate the Meyer 26 (Hines, n.d.).
The name PageRank is a Google trademark that is patented to Stanford University. Google has an exclusive license rights from Stanford University. During this project, they utilized the concept of link analysis on the web graph. The concept of PageRanking is synonymous to the ranking algorithm by observing how different links are interconnected (Hines, n.d.).
PageRank was influenced by hyper search developed by Massimo Marchiori at the University of Padua and citation analysis developed in the 1950s by Eugene Garfield at the University of Pennsylvania (Hines, n.d.). A small search engine developed by Robin Li from the IDD information services since 1996, called RankDex (Hines, n.d.) was already using a similar strategy for scoring and ranking websites.
Page and Brin developed the PageRank algorithm for Google based on the concept of link analysis by viewing a recommended hyperlink and attaching more value to links that are approved by more credible sources. Page and Brin considered the fact that the amount and quality of references to a particular web document is related to its importance or quality. It means that links are not only given higher value by just having more references, but they are considered to be more valuable if referenced by other credible sources (Hines, n.d.).
Detailed Description of the Algorithm
The basic concept underlying the function of PageRank algorithm is that of trying to infer the relevance and value of a web page from just topological structure of a directed graph in association with the World Wide Web (Batra & Sharma, 2013). Googles PageRank algorithm is incredibly complex and is constantly being updated, making it more difficult to analyze the exact algorithm. The original algorithm is mathematically represented, as follows (Batra, and Sharma, 2013):
PR(A)=(1-A) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
- PR(A) refers to the PageRank of page A;
- (PR(T1) refers to the PageRank of pages T1 which is linked to page A;
- C(T1) refers to the number of outbound links on page T1;
- d refers to the damping factor which ranges between 0 to 1.
However, the basic version of the algorithm is based on the fact that the importance of a web page is determined by importance of other pages linking to it. It is also true that a web page imparts importance to other pages linked to it. As a search engine, Google should be able to accomplish two major tasks (Yan, n.d.):
- Using the concept of crawling, the web and indexing found data, Google should be able to access and remember all websites.
- It should be able to determine the order at which pages are displayed by any search query.
A large number of web pages will always match the specified text for any search but not all text is necessary for the intent of the search, thus the need for Google to know how important a particular web page is. This helps in determining the most relevant web pages, thus displaying them as the first results (Yan, n.d.). This task is been carried out by the PageRank algorithm through evaluating each web page and its importance in relation to other web pages.
In this case, directed graphs can be associated to the World Wide Web, where each web page is represented by a node, and each arc from node i to j is representing a link from page i to j. The rank of a page might be dependent on the ranks of all the pages directed to it, where each rank is divided by the number of out-links possessed by those pages.
However, the importance of any page is equal to the total sum of the important values of the pages linked to it. One of the concerns of this algorithm is that a website with many links will be more significant than the one with lesser links (Yan, n.d.). For instance, if a web page with importance of 2 is linked to 20 other pages, it will impart 40 importance values to all the pages.
However, if a page with importance of 10 is linked to only one page, then it will impart the importance of 10, despite the higher value of importance. In this case, there is a need to limit the total impact any page can have.
It is possible to solve the problem by changing the ranking value added to a page by fractionalizing its total ranking value. For example, if a page P with importance of 2 links to a total of n pages, the importance it imparts to each of the linked pages can be expressed as 2/n.
- Free plagiarism report (on request)
- Free revision (within 2 days)
- Free title page
- Free bibliography
- Free outline (on request)
- Free email delivery
- Free formatting
- Quality research and writing
- BA, MA and PhD degree writers
- 100% confidentiality
- No hidden charges
- Never resold works
- 100% authenticity
- 24/7/365 Customer Support
- 12pt. Times New Roman
- Double/Single-spaced papers
- 1inch margins
- Any citation style
- Fully referenced papers
- Up-to-date sources
In this case, there is no need to change how to determine the importance value which helps to represent the importance of networked pages in a matrix format (Yan, n.d.). In order to illustrate this algorithm, a theoretical network of five web pages is used (Batra, and Sharma, 2013). A graph is created for the network, where directed pages are used to link the pages together.
Page 1 is linked to page 2 and 3, page 2 is linked to page 3 and page 5, page 3 is linked to page 2, page 4 is linked to page 1, and, finally, page 5 links to page 1, 3, and 4. The network can be represented as a matrix, assuming that every page has an importance value of 1 (Batra, and Sharma, 2013).
On the matrix, the column represents the outgoing links for a web page, while the rows represent the incoming links for a web page. From this illustration, we can see that the importance of a web page has to be the sum of its rows (Batra, and Sharma, 2013). Using the example above, page 1 will have an importance value of 1x4 + 1/3x4, where x4 and x5 represent the importance values for pages 4 and 5, respectively.
The total importance values for the five pages are shown on the matrix above. Thus, the most important page is page 2, with the P0ageRank of 2. Therefore, all search results will be returned in the order 2>1=3>4=5.
Comparison between PageRank, TrustRank, and Hits Algorithms
PageRank, TrustRank, and Hits algorithms are some of the major algorithms which are being discussed on search engine optimization forums. While PageRank is the numerical value between the ranges of 1 to 10 for any single web page (Batra & Sharma, 2013); TrustRank is the level of trust granted to any single website and Hits algorithm is an iterative algorithm determined by the links of documents on the web (Batra & Sharma, 2013).
However, the Hits algorithm, unlike the PageRank algorithm which is executed at indexing time is executed at the query time. It suffices to say that the PageRank algorithm has been a popular algorithms used in Google search engine, but Trust Rank can be bolted onto PageRank for better search results.
Primary Uses of the Algorithm
The most important use of the PageRank algorithm is for the search engine, as it was specifically developed for Google search engine. PageRank algorithm helps to rank websites for the provision of more relevant and faster results. However, this concept can be used in a vast number of other systems, since the core concept is to track the relationship between linked pages.
This helps to know which nodes are linked the most and are more important to the overall system. Therefore, the PageRank algorithm can be used in any system, in order to determine the most important sections (Yan, n.d.).
This algorithm can also be used in determining key species in ecology. It is used to map the relationship between species in ecosystem and identifying the most important species. This helps in assigning importance towards key plant and animal species in an ecosystem, which makes it easy to forecast consequences, such as removal or extinction of species from the ecosystem (Yan, n.d.).
The PageRank algorithm is also used in literal analysis by determining word-sense disambiguation. Word-sense disambiguation refers to the process of determining the exact meaning of a word used based on the particular context, since words can have several meanings (Yan, n.d.). Thus, this algorithm is applied towards graphs which are been extracted from natural language documents that help in identifying the most commonly websites.
The algorithm can also be used to determine false positives in an alarm system. One of the major concerns in security systems is the overwhelming amount of false positives which are usually caused by user error or other non-dangerous factors. The application of the PageRank algorithm in graphs of activated alarms allows users to isolate locations that have known attackers, thus determining the effects on other locations (Yan, n.d.). However, this provides a higher probability of identifying true positives.
PageRank algorithm can also be used in computer forensics, which involves obtaining and identifying evidence, and also the analysis of the collected evidence. The analysis of the collected evidence in computer forensics includes a series of keywords which are used to have access to the most important information, the right file logs, and analyzing and exploring the links between different lines of evidence, etc. The analysis can be done using the PageRank algorithm. The algorithm can efficiently catalog the electronic evidence, and identify the relationship between evidences in each category.
Another use of the PageRank algorithm is in searching networks outside the internet (Yan, n.d.). For example, it can be applied in academic writings by substituting links with citations. The PageRank algorithm can identify the most referenced and most effective academic papers.
However, this application of the algorithm can be abused, depending on the authors of the papers by adding irrelevant citations. Moreover, the value of research papers can be inflated in PageRank to determine it, thus it is best applied to smaller networks.
Future Consideration of the Algorithm
The standard PageRank algorithm has been extended to the weighted PageRank proposed by Wenpu Xing and Ali Ghorbani due to the tendency of users not to follow direct links to required pages (Tuteja, 2013). The algorithm is based on the concept that if a page is important, then it is linked to other web pages. The rank of a page is shared among its outbound linked pages, proportional to its popularity. This modification is based on the fact that not all users follow direct links on the World Wide Web.
However, due to the large amount of information on the web, more time is required by users to get access to relevant pages. In this regard, an algorithm has been proposed to make use of number of times a link has been visited, in order to determine the rank of a page (Tuteja, 2013).
Considering the fact that users can actually add to increase the rank of pages without really doing something productive, the proposed algorithm may help users to have access to the required information faster. However, some of the future considerations for the proposed algorithm are (Tuteja, 2013):
The need for a web graph with a large number of hyperlinks and websites in order to check how important and accurate the method is.
The need for some other measures, such as the recent use of the information about users, links and the time spent on web pages corresponding to the links.
Summary and Conclusion
Googles success is attributed to its PageRank algorithm which allows it to provide the required web pages based on their order of importance. It is worth noting that the original algorithm used by Google is not publicly accessible due to its complex nature. The algorithm is constantly been updated due to the always-changing nature of the internet. Google still uses the weighted link matrix to determine PageRank, even though it may be complex.
The concept of the algorithm can, however, be applied in various systems different from that of the search engine, some of which have been discussed above. However, the key concept of the algorithm is to track links between nodes, which helps to rank pages according to their importance. Generally, it can be said that this algorithm can be applied to any system that requires the ranking of important values within the system.