Things You Might Not Know About robots.txt

While bringing one of my old sites up to spec, I realized that I’d never actually looked into robots.txt beyond copy-pasting ready-made directives.

So, without further ado, here’s a list of everything I could find about it which isn’t obvious if you also learned about in an ad-hoc fashion:

  1. Crawlers will obey only the directives from the most specific User-agent rule which matches. (I assume this is because the Allow directive is much younger than the Disallow directive.)
  2. Paths must start with a leading slash or they’ll be ignored.
  3. Longer paths (defined by character count) win over shorter paths when both Disallow and Allow match.
  4. Allow is younger and less likely to be supported by crawlers than Disallow.
  5. Crawlers will compare against the percent-encoded form of URLs when checking rules.
  6. Match patterns aren’t regexes, substring matches, literal matches, or globs… they’re literal prefix matches augmented with support for two metacharacters:
    1. The * to match any string of characters.
    2. The $ to match the end of the URL (It has no special meaning when not at the end of the pattern, so it can be escaped by using $* instead)
  7. robots.txt won’t prevent links from appearing in Google.
    1. Google will still show excluded pages if linked from allowed pages… the listings will just be bare URLs without page titles or excerpts.
    2. Pages covered by robots.txt can’t contribute their PageRank to your site.
    3. Bottom line: robots.txt is for controlling resource consumption. Use the HTML noindex/nofollow meta tags and X-Robots-Tag HTTP headers for hiding content for other reasons.
  8. Don’t exclude GoogleBot from your CSS and JavaScript. Google actually renders your pages in order to find more content than competitors and you’ll be penalized for this under their “don’t show GoogleBot different content than real users” policy because it could be interfering with the ability to retrieve AJAX-loaded content or detect a paywall.
  9. I shouldn’t have to say this, but robots.txt is advisory.
    1. Use it to hide pages like your shopping cart page.
    2. Use it to prevent search engines from wasting their time in your spambot honeypots.
    3. Use it to keep search engines from walking a big tree of dynamically-generated filter pages which ultimately terminate at pages you’ve indexed in a more static fashion elsewhere in the site.
    4. Use it to opt out of aggressive prefetching extensions like Fasterfox
    5. …just don’t think it has any benefits for security or secrecy.
  10. Historically, some search crawlers have been finicky, so be strict in your structure:
    1. Order your directives “User-agent, Disallow, Allow, Sitemap, Crawl-delay, Host
    2. Only put one path pattern per Disallow/Allow line.
    3. If you must, comments begin with # but I advise against them.
    4. Avoid blank lines when feasible.
  11. The non-standard Host directive allows you to tell (which powers DuckDuckGo at the moment) that domains X, Y, and Z are mirrors, with X being the authoritative source.
  12. Google does not honour Crawl-delay. You need to set it in the Google Webmaster Tools.
  13. Use the Google Search Console in Google Webmaster Tools to keep an eye out for robots.txt mistakes hiding pages you actually want crawled.
  14. Make sure your site is replying with Error 400 if query parameters fail to parse.
    1. Google will sometimes generate search queries to try to tease out hidden content and, as one of my sites discovered.
    2. On one of my sites, I have a query parameter that’s used to filter a listing of titles by their first character. (ie A-Z or a number, like a pocket phone directory)
    3. Despite it not being tied to a search field anywhere, GoogleBot concluded it was a search field and started spamming it with irrelevant crud.
    4. If GoogleBot receives Error 404 after it received a 200 OK for other values of the same query parameter, it apparently concludes that Error 404 means “No results. Try another.”
    5. Error 400 is the HTTP response for “malformed request”. It’s typically used for things like JSON APIs, but it applies equally well to “Validator expected a single alphanumeric character. Received a GoogleBot-generated query string.”
    6. Sending error 400 for any malformed URL causes GoogleBot to quickly learn to confine its guessing to actual search fields.

For more, the Robots.txt Tutorial is the best “from beginning to reference charts” introduction I found while catching up my knowledge.

P.S. While not specifically a robots.txt thing, I learned that Google will honour an `hreflang` attribute on <link> tags and Link headers and it’s always a good thing to give GoogleBot more information to make informed crawling decisions with.

CC BY-SA 4.0 Things You Might Not Know About robots.txt by Stephan Sokolow is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This entry was posted in Web Wandering & Opinion. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

By submitting a comment here you grant this site a perpetual license to reproduce your words and name/web site in attribution under the same terms as the associated post.

All comments are moderated. If your comment is generic enough to apply to any post, it will be assumed to be spam. Borderline comments will have their URL field erased before being approved.