Google Crawler

Definition

A Web Crawler is a computer program that browses the World Wide Web in a methodical and automated manner or in an orderly fashion.

What Crawlers Are

  • Crawlers are computer programs that roam the Web with the goal of automating something specific related to the Web.
  • The role of crawlers is to collect web content.

Beginning

  • A key motivation for designing Web Crawlers has been to retrieve Web pages and add their representations to a local repository.
  • A Google Crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited depending on a set of policies.

How Does Google Crawler Work

  • It starts with a list of URLs to visit, called the seeds As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of visited URLs, called the crawl frontier.
  • URLs from the frontier are recursively visited depending on a set of policies.

The Basic Algorithm

{
    Pick up the next URL
   Connect to the Server
   GET the URL
   When the pages arrives, get its Links
  (optionally do other stuff)
   REPEAT
}

  • Search Engine Marketing: SEM is all that a company can do to advertise itself on a search engine, including paid inclusion and other ads.
  • Search Engine Optimization: Process of improving the visibility of website or a webpage in search engines via the "natural" or un-paid something.
  • The name of the Google's Web Crawler is Googlebot (Spider).
  • It's a network of powerful computers that work together and visits web servers and requests thousands of pages at a time.
  • 1998: Googlebot, S. Brin and L. Page.

Examples

  • Yahoo! Slurp: Yahoo Search crawler.
  • Msnbot: Microsoft's Bing Web Crawler.
  • Googlebot: Google's Web Crawler.
  • WebCrawler: Used to build the first publicly-available full-text index of a subset of the Web.
  • World Wide Web Worm: Used to build a simple index of document titles and URLs.
  • Web Fountain: Distributed, modular crawler written in C++.
  • Slug: Semantic Web Crawler.

Types of Googlebot

  • Deepbot: Visits all the pages it can find on the web by harvesting every link it discovers and following it. It currently takes it about a month to perform this deep crawl.
  • Freshbot: Keeps the index fresh by visiting sites that change frequently at more regular intervals. The rate at which the website is updated dictates how often Freshbot visits it.
  • The process or program used by search engines to download pages from the web for later.
  • Processing by a search engine that will index the downloaded pages to provide fast searches.
  • A program or automated script that browses the World Wide Web in a methodical something.
  • Automated manner also known as web spiders and web robots.
  • Less-used names: ants, bots and worms.

Conclusion

  • Web Crawlers are an important aspect of the search engines.
  • Web crawling processes deemed high performance are the basic components of various Web services.
It is not a trivial matter to set up such systems:
  1. Data manipulated by these crawlers cover a wide area.
  2. It is crucial to preserve a good balance between random access memory and disk accesses.

Up Next
    Ebook Download
    View all
    Learn
    View all