Google Now Open Sources Its Robots.txt Parser Code

The Robot Exclusion Protocol (REP) also known as robot.txt is owned by the search engine giant Google. Robots.txt is used by millions of sites to instruct Google search engine crawlers that which part of their blog content should be crawled or not.

For 25 years, the REP was not standardized and it leads to different interpretations and frustrating implications. But now Google has officially announced to make Robot Exclusion Protocol an Internet standard by open sourcing the robot.txt parser codes and the C++ library which was created 20 years back. These C++ libraries are used by the company’s production system for parsing and matching rules in robots.txt files. The tool is available on GitHub.

If Google bots crawl your site and your site is missing robots.txt file then the bots crawl the entire content of your site. This happens because there is no robots.txt file on your site to find and instruct the bots and crawlers that which part of your site should be crawled or indexed and which part should be ignored.

But this rule or protocol has been changed differently over the years by the developers causing difficulty in making a standard rule. Talking about uncertainty in cases like “the text editor includes BOM characters in their robots.txt file”. And for the crawlers and tool developers, there is always uncertainty on how should they deal with robts.txt files which are hundreds of megabytes large.

▪ The Privacy War Continues - Apple Replies Back To Google At WWDC 2024

These are the main reasons why Google has made REP an official Internet standard, thus making fixed rules for all. Google has explained in a document that how REP should be used and send a copy of it to the Internet Engineering Task Force (IETF) as a proposal.

Google Now Open Sources Its Robots.txt Parser Code

About Author

Leave a Comment Cancel reply