indie-blogs-bg

New Open Source Robots.Txt Projects

New Open Source Robots.Txt Projects featured image
23 Sept 2020
Nirlep Patel
Blog

Google created a milestone last year by making the robots.txt parser an open-source. Being a mere de- facto standard for nearly 25 years, the Robot Exclusion Protocol is now an actual internet standard and used officially by Google for crawling web pages. What does this mean? It means that Google has open-sourced the C++ Library that it has been using since the past two decades for parsing and matching the rules in robots.txt files. These new changes concerning robots.txt files have made it easier to work with. Such is the importance of the robots.txt structure that even this year Google announced a new development concerning it.

Google announced the launch of two new open-source robots.txt projects:

  1. Robots.txt Specification Text
  2. Java robots.txt parser and matcher

Both these developments were developed by two Google Interns Andrea Dutulescu who created the Robots.txt Specification test and Ian Dolzhanskii who created the java robots.txt parser and matcher.

Now what exactly are these two new open-source projects and how are they useful for Google? Before knowing this, I would first like to state what is a robots.txt file and why is it used.

What is a robots.txt file?

 The robots.txt file is a part of The Robots Exclusion Standard or Robots Exclusion Protocol is a standard form of communication that is used by websites to communicate with web crawlers or web robots. Simply stated, the robots.txt file tells Googlebot, Google’s web crawler which pages or files to crawl and which pages to keep on Google. It helps in managing the crawler traffic of your website.

Now that you know what a robots.txt file does, let us look at the latest developments related to this file.

A.Robots.txt Specification Text

Developed by Google Intern Andreea Dutulescu, this specification tool is a testing framework for robot.txt parser developers. This specification test is used to test whether or not a robots.txt file follows the Robot Exclusion Protocol and if it does then till what extent. This test created by Andreea is useful as currently there is no valid or thorough test for assessing the validity of a parser. With the help of this tool can be used for creating robots.txt parsers which follow the protocol thoroughly.

  1. Java Robots.txt parser and matcher

Developed by Ian Dolzhanskii, Google has made this recently developed tool its official Java port of the C++ Robots.txt parser. Java is the thor most popular programming language to learn in 2020 and is the most extensively used programming language for Google. This parser helps in translating the behaviour and functions of the C++ parser and has been thoroughly tested. Google is planning on using the Java Robots.txt parser in their production systems.

Google aims at simplifying a web developers job. Last year it open-sourced the C++ library used by production systems for parsing and matching rules in robots.txt files and included a testing tool with the package.

This year with the introduction of two new open-source robots.txt projects, Google plans to make the testing validity for the said files easier that in turn will create a better web crawling system experience.