Sunday, January 4, 2015

Please Don't Use Robots.txt

Webmasters can place a file called "robots.txt" in the root of their website to prevent automated web crawlers from accessing it. Access can be controlled by user agent and path. See Wikipedia's robots.txt for an example.

You may have also heard of the Internet Archive, a nonprofit organization dedicated to saving the Internet from link rot and content takedown. Sadly, and to the chagrin of several users, many robots.txt directives block the Internet Archiver. Also, when the archiver notices that a site has put up a robots.txt that prevents access to it, it retroactively denies public access to existing archives.

So, if a site exists for many years, is archived by the Wayback Machine, goes out of business or otherwise loses its domain name and content, and then is replaced by some squatter site that has a blocking robots.txt, all that real content is gone forever. (The Archive doesn't delete it from their servers, but they do deny public access.)

It's entirely possible for them to make the archiver ignore robots.txt (and many want them to) - there's nothing that can force a client to not read a website besides an IP block at the routing level - but doing so would likely create offense and shock in the webmastering world.

Therefore, I would like to ask every webmaster to allow "ia_archiver" and "alexa" to crawl their sites. The archiver robot does not access pages at high speed; it is generally a responsible robot. Nothing lasts forever, but it is very likely that the Internet Archive has better technological infrastructure than you. By allowing your content to be archived, you preserve humanity's knowledge for the future.

1 comment: