Web Crawlers: Spies of the Internet

Member-only story

Feb 15, 2022

Web Crawlers: Spies of the Internet

To find information on the hundreds of millions of webpages that exist on the Internet, a search engine employs special programs called Web crawlers or Spiders. And they gather information in an automated and orderly way for indexing.

Legitimate sites, in particular search engines like Google, do web crawling as a means to provide up-to-date information.

Usually, a web crawler finds its instructions from robot.txt file on a website. Then it crawls in a recursive fashion within the tree-like structure of the website to index content.

Obviously, crawling consumes resources of the systems they visit, and often without permission. So mechanisms exist for websites not wishing to be crawled to make this known to the crawling agent (compromising SEO).

Blogs like Medium are more actively crawled because they offer fresh content and gets updated several times a day.

Written by Nishu Jain

No responses yet