How crawl base search engine works

Dec 6, 2012 00:00 · 852 words · 4 minute read SEO

How crawl base search engine works

This part of the research tries to have an investigative outlook on crawl base search engines. Hence, it tries to focus on the functionality of search engine in this section.

Crawling

A WebCrawler programmed to browse webpages in an automatic way orderly. Other terms of it are “bots”, “spider”, “robots”, “ants” and “indexer”. They help search engine to provide up-to-date information. Mainly they aim to download the pages and subsequently add them to the search engines’ index. However, they sometimes do other tasks such as “HTML validation” or gathering special types of information like “email addresses”.

They use their crawling by list of URLs i.e. seeds and then continue by finding new webpages through hyperlinks and sitemaps, these being available in their seeds (Google, Google Basics, 2011). A study in 2005 showed that most of the search engine could only cover 40-70 % of the index able web, having only progressed from 1999 by only 16 %. Therefore it shows the importance of downloading high quality pages rather than simple html pages.

Strategy

WebCrawler uses special algorithm and method to avoid download duplicate content and make priorities to download information in an efficient manner (Edward et.al). They use set of policies such as “a selection policy” to indicate which pages should be downloaded, “a re-visit policy” which state that when to check updates, “a politeness policy” to avoid overloading and “a parallelization policy” which helps to coordinate web crawler.

Architecture

A good crawler should benefit from highly optimised architecture to enable download of millions of pages in each week. Web Crawlers are the heart of each search engine and there are not much public details about their functionality as they have concern about web spammers which damage their efficiency of their results (Shashi, Rohit, & Karm Veer, 2010).

Index

In the next step they analyze the website information by considering each page title, content, Meta tags, etc. Subsequently, they index them in their database to allow users to find their relevant information as quickly as possible. Famous search engines try to index full text version of each page as well as associating various multimedia contents, such as video, audio and images to them. However, others restrict their indexes to partial-text and use other services information to provide a real time result. Indexer care about factors such as “index size”, “lookup speed”, “maintenance”, “storage type”, etc. Search engine use different method of data indexing. For instance the “suffix tree” method shows that its structure looks like a tree and has support linear time lookup. Although it has many advantages, it sometimes needs too much storage space. Suffix array is an alternative method which requires less space and memory for this purpose.

They also benefits from other methods such as “inverted index” (Black, 2008) to store occurrence of each search at binary tree, “citation index” to analyze the connection of hyperlinks and “Ngram index” which focus on the sequence of data (Foster, 1965).

Serving

When a user types their query, the search engine uses its criteria to find and show most relevant information in a short summary list. The usefulness of search engine highly depend on this part as there are millions of webpages that contain user query term but are not necessarily relevant.

There are usually four types of queries which are supported by search engines.

Informational queries, which refer to broad topics.
Navigational requests, which has an emphasis on a particular website e.g. ebay.com, youtube.com
Transactional queries, which users try to perform an action e.g. buy a car
Connectivity requests, that are looking for connectivity of the pages e.g. which websites link to a particular webpage (Manning, Raghavan, & Schütze, 2009) (Enge, Spencer, Fishkin, & Stricchiola, 2010, p. 5).
Although most of the search queries are informational, it is vague to distinguish the type of the queries in some cases. For instance according to the report of the Queensland and Pennsylvania State universities 80% of requests are informational while the IBM research indicates this number to be about only 50% (Jansen, Booth, & Spink, 2008) (Broder, 2002).

References

Black, P. (2008). inverted index. Retrieved January 8, 2012, from National Institute of Standards and technology: [1]
Broder, A. (2002). A taxonomy of web search. ACM Digital Library, 36(2), 3-10.
Enge, E., Spencer, S., Fishkin, R., & Stricchiola, J. (2010). The art of SEO (1st ed.). (M. Treseler, Ed.) Sebastopol: O’Reilly.
Foster. (1965). Information retrieval: information storage and retrieval using AVL trees. ACM Digital Library, 192-205.
Grappone, j., & Couzin, G. (2011). Search engine optimization An hour a day (3rd ed.). (P. Gaughan, Ed.) Indianapolis: Willey.
Jansen, B., Booth, D., & Spink, A. (2008). Determining the informational, navigational, and transactional intent of Web queries. ScienceDirect, 44(3), 1251-1266.
Manning, C., Raghavan, P., & Schütze, H. (2009). An Introduction to Information Retrieval. Retrieved January 8, 2012, from stanford university: [2]
Shashi, S., Rohit, A., & Karm Veer, A. (2010). Advances in Computer Engineering (ACE), 2010 International Conference on . (pp. 29-33). IEEE.
Wayne, K., & Sedgewick, R. (2011, January 10). Balanced search tree. Retrieved January 14, 2012, from algorithms: [3]