Things You Might Not Know About robots.txt

While bringing one of my old sites up to spec, I realized that I’d never actually looked into robots.txt beyond copy-pasting ready-made directives.

So, without further ado, here’s a list of everything I could find about it which isn’t obvious if you also learned about in an ad-hoc fashion:

  1. Crawlers will obey only the directives from the most specific User-agent rule which matches. (I assume this is because the Allow directive is much younger than the Disallow directive.)
  2. Paths must start with a leading slash or they’ll be ignored.
  3. Longer paths (defined by character count) win over shorter paths when both Disallow and Allow match.
  4. Allow is younger and less likely to be supported by crawlers than Disallow.
  5. Crawlers will compare against the percent-encoded form of URLs when checking rules.
  6. Match patterns aren’t regexes, substring matches, literal matches, or globs… they’re literal prefix matches augmented with support for two metacharacters:
    1. The * to match any string of characters.
    2. The $ to match the end of the URL (It has no special meaning when not at the end of the pattern, so it can be escaped by using $* instead)
  7. robots.txt won’t prevent links from appearing in Google.
    1. Google will still show excluded pages if linked from allowed pages… the listings will just be bare URLs without page titles or excerpts.
    2. Pages covered by robots.txt can’t contribute their PageRank to your site.
    3. Bottom line: robots.txt is for controlling resource consumption. Use the HTML noindex/nofollow meta tags and X-Robots-Tag HTTP headers for hiding content for other reasons.
  8. Don’t exclude GoogleBot from your CSS and JavaScript. Google actually renders your pages in order to find more content than competitors and you’ll be penalized for this under their “don’t show GoogleBot different content than real users” policy because it could be interfering with the ability to retrieve AJAX-loaded content or detect a paywall.
  9. I shouldn’t have to say this, but robots.txt is advisory.
    1. Use it to hide pages like your shopping cart page.
    2. Use it to prevent search engines from wasting their time in your spambot honeypots.
    3. Use it to keep search engines from walking a big tree of dynamically-generated filter pages which ultimately terminate at pages you’ve indexed in a more static fashion elsewhere in the site.
    4. Use it to opt out of aggressive prefetching extensions like Fasterfox
    5. …just don’t think it has any benefits for security or secrecy.
  10. Historically, some search crawlers have been finicky, so be strict in your structure:
    1. Order your directives “User-agent, Disallow, Allow, Sitemap, Crawl-delay, Host
    2. Only put one path pattern per Disallow/Allow line.
    3. If you must, comments begin with # but I advise against them.
    4. Avoid blank lines when feasible.
  11. The non-standard Host directive allows you to tell Yandex.ru (which powers DuckDuckGo at the moment) that domains X, Y, and Z are mirrors, with X being the authoritative source.
  12. Google does not honour Crawl-delay. You need to set it in the Google Webmaster Tools.
  13. Use the Google Search Console in Google Webmaster Tools to keep an eye out for robots.txt mistakes hiding pages you actually want crawled.
  14. Make sure your site is replying with Error 400 if query parameters fail to parse.
    1. Google will sometimes generate search queries to try to tease out hidden content and, as one of my sites discovered.
    2. On one of my sites, I have a query parameter that’s used to filter a listing of titles by their first character. (ie A-Z or a number, like a pocket phone directory)
    3. Despite it not being tied to a search field anywhere, GoogleBot concluded it was a search field and started spamming it with irrelevant crud.
    4. If GoogleBot receives Error 404 after it received a 200 OK for other values of the same query parameter, it apparently concludes that Error 404 means “No results. Try another.”
    5. Error 400 is the HTTP response for “malformed request”. It’s typically used for things like JSON APIs, but it applies equally well to “Validator expected a single alphanumeric character. Received a GoogleBot-generated query string.”
    6. Sending error 400 for any malformed URL causes GoogleBot to quickly learn to confine its guessing to actual search fields.

For more, the SEOBook.com Robots.txt Tutorial is the best “from beginning to reference charts” introduction I found while catching up my knowledge.

P.S. While not specifically a robots.txt thing, I learned that Google will honour an `hreflang` attribute on <link> tags and Link headers and it’s always a good thing to give GoogleBot more information to make informed crawling decisions with.

Posted in Web Wandering & Opinion | Leave a comment

Checking if daemons have been restarted

TL;DR: Use this script.

I’ve been working on an ansible script to set up web hosting (because cheap VPSes are cheaper than cheap shared hosting for the features I want) and, since I don’t like having to put in effort for maintenance and debugging, I’ve been trying to make it as robust as possible.

As part of that, I wrote a little python helper script (tested under 2.7 and 3.3) which checks that all processes matching a given name were started after the modification date of a given config file. (In other words, it checks if the config changes have been applied)

I just thought I’d share it in case anyone else finds it useful. Unlike many other approaches I’ve seen for looking up when a process was started, it doesn’t read the system time twice to convert from “seconds since boot” to “seconds since the epoch”, so it should give perfectly deterministic results.

Posted in Geek Stuff | Leave a comment

Watching For Changes in Window Focus Under X11

TL;DR: Here’s some example code. (backup copy)

UPDATE: It now also has the code to watch for the active window’s title changing without the window having to lose and regain focus.

I needed to explain to someone the proper way to watch a Linux desktop for changes to the active window (ie. When the user focuses a new window) and, since I had a surprising amount of trouble finding a reliable way to do this, I thought I’d blog about it.

The first thing you’re likely to see when you go looking for this is people polling commands like xprop to get that information. I think everyone knows this is a method of last resort. All the people doing it seem to.

Option two, once you compose the right search queries, is to watch FocusOut and possibly FocusIn events across your whole desktop. Some people seem to get that working but, unfortunately, it was very erratic on my KDE 4.x desktop.

Finally, the solution that’s both simple (by X11 standards) and seems to work reliably for me. I’d like to thank @alanc on StackOverflow for drawing my attention to this: You can opt into property-change notifications on the root window and watch _NET_ACTIVE_WINDOW. (Which is a standard property maintained by any modern window manager)

You will still want to do your own change checking, though. Watching the property is good for minimizing resource consumption, but I did receive duplicate events.

In addition to GitHub Gist, I’ve also posted the example in several StackOverflow answers (such as this one) in the hope that others will have a much easier time of it than I did.

Posted in Geek Stuff | Leave a comment

Making MPV EDL files double-clickable on Linux

After using OpenCV to skip post-roll ads, I wanted to share the relief with family who don’t launch their video players from the command line, so I researched how to associate the resulting .mpv.edl files with MPV so they could be double-clicked.

I won’t babble on. Here’s the script (including links to the reference material I used), which also associates any file which has “# mpv EDL” as the first bytes in the file.

Posted in Geek Stuff | Leave a comment

A Better Linkifying Regex

From time to time, I run across situations where the linkifying Greasemonkey script I use mistakenly includes a closing parenthesis in what it considers to be a URL.

Given that I can’t remember a single situation where I needed to linkify a URL with nested unescaped parentheses but URLs inside parentheses have bitten me repeatedly, I decided to solve the problem in a way that’ll work with any regex grammar.

const urlRegex = /\b((ht|f)tps?:\/\/[^\s+\"\<\>()]+(\([^\s+\"\<\>()]*\)[^\s+\"\<\>()]*)*)/ig;

Basically, it matches:

  1. http, https, ftp, or ftps followed by ://
  2. an alternating sequence of “stuff” and balanced pairs of parentheses containing “stuff”

…where “stuff” to refers to a sequence of zero or more non-whitespace, non-parenthesis characters (and, in this linkify.user.js version, non-caret, non-double-quote too).

Embarassingly, aside from two corrections and a few extra characters in the blacklists that I kept from the original linkify.user.js regex, this is a direct translation of something I wrote for http://ssokolow.com/scripts/ years ago… I’d just never remembered the problem in a situation where I could spare the time and willpower to do something about it.

Here’s the corrected Python original.

hyperlinkable_url_re = re.compile(r"""((?:ht|f)tps?://[^\s()]+(?:\([^\s()]*\)[^\s()]*)*)""", re.IGNORECASE | re.UNICODE)

The corrections made were:

  1. Allow the pairs of literal parentheses to be empty
  2. Move a grouping parenthesis so that a(b)c(d)e will be matched as readily as a(b)(c)d.

Markdown source code works especially well to demonstrate the difference.

Naive Linkifying Regex My Linkifying Regex
[FreeDOS](http://freedos.org). [FreeDOS](http://freedos.org).

Theoretically, look-ahead/behind assertions are enough of an extension to regexp syntax to allow real HTML parsing, so I could probably also support nested parens, but I’m just not in the mood to self-nerd-snipe right now.

Posted in Geek Stuff | Leave a comment

Mixed Feelings on Cloanto and Amiga/C64 Forever

UPDATE: I’ve received a response from Cloanto and, after talking to a real human about this, I’m convinced that this is mostly, if not entirely, a pile of unfortunate mistakes that they sincerely want to get fixed. I’ve added notes to clarify things.

As someone who prefers to take the high ground, when I was offered the opportunity to get Amiga Forever and C64 Forever at a big discount, I jumped at it. My first PC was an original IBM PC and I’d missed out on those famous platforms entirely… here was a chance to get into them without compromising my principles.

I also loved how, as a Linux user, Cloanto seems to be walking a balance with their cross-platform support page… admitting that their digital download releases are MSI installers, but providing what I’ll call “relaxed support” for other platforms from CD/DVD versions and clarifying that users have confirmed the ability to generate them from the MSIs using Wine or equivalent.

However, after deciding to purchase, I noticed that their website design wasn’t the only thing that, to be kind, felt a bit dated.

First, their purchase process. Is it really necessary to ask users for their shipping address if they’ve selected a digital download item and PayPal payment? I could easily see that driving away some on-the-fence buyers who value their privacy.

UPDATE: We’re still talking, but this looks to me like one of those “their merchant services provider doesn’t understand this marget segment” issues… they’re already working on a new site design to help remedy that problem as much as possible.

Second, the post-purchase e-mails. Where do I start?

  1. Is it really necessary for me to receive seven e-mails in response to a successfully completed transaction?
  2. What’s the point in sending me an e-mail, just to tell me to log into my e-mail account to follow the instructions in the e-mail I’m about to receive? (No joke.) I’m not going to see it until I’ve done what it’s telling me to do!
  3. An anti-fraud measure involving them asking me to confirm my PayPal e-mail? What’s wrong with just asking PayPal if I’m a verified user. (Oh well, if Cloanto or Avangate start spamming me, I can just move PayPal to a new alias and delete the old one.)
  4. Is it really necessary to send three different confirmation e-mails for different stages of the process, rather than just waiting a couple of seconds and sending one combined e-mail?

Oh well… on to the next problem.

UPDATE: The seven e-mails are all from their merchant services provider (Avangate) and I received no argument on this side that it’s excessive. They’ve passed on my concerns via the B2B communication channels available to them.

Third, the registration keys.

As soon as I saw those, I immediately worried that maybe Amiga Forever and C64 Forever were online-activated products and the installers would stop working if Cloanto went out of business.

(Thankfully, the “Forever” in the title does appear to be accurate, as they installed without complaint on the quarantined Windows XP retro-PC that does double-duty as an online activation tester. No need to demand a refund.)

UPDATE: We’re still talking, but I’ve suggested, at minimum, that an explanation of the key’s purpose (unlocking the paid content in a multi-role offline install package) be provided either with or before showing the keys. I also made suggestions for a longer-term strategy.

Fourth, when your selling a game-related product in the era of services like Steam and GOG, this can easily trigger buyer’s remorse:

You have 50 downloads remaining.
Link expires on: November 16, 2017.

There are many reasons this is a problem:

  1. This is retro-emulation stuff with the bulk of it being more than 20 years old. It’s already easy enough and tempting enough for people to pirate it without adding an expiry message so they can’t rationalize it as paying for a booklet of 50 off-site backup coupons.
  2. In the era of cheap hosting like Amazon S3 and “re-download is better than backup” services like Steam/Origin/uPlay/GOG/etc., is it really necessary to make people with flaky connections worry about whether their download manager’s resume feature will chew up most of their redownloads?
  3. The installer acts as if the same download is offered for both trial and paid copies, depending on whether you enter a registration code. Again, why am I made to agonize over a redownload limit and expiry counter on this thing?

All in all, this screams “Danger, Will Robinson! Danger!” because this kind of out-of-touchness makes me worry about whether they’ll remain competitive enough in the market to avoid going under.

*sigh* Ok, I’ve paid for the damn thing and, as much as it hurts, I’ve already spent far too much money on taking the moral high ground for other platforms (eg. I use a Retrode and buy actual cartridges, rather than being locked into Nintendo’s Virtual Console.). What’s next?

UPDATE: They’ve actually been trying to get Avangate to understand this for a while and providing their own accounts system to resolve this is part of the reason they’re working on a new site design.

Fifth, the download speed.

An average download speed of 80KiB/s because it gives me spikes of full speed alternated with several seconds of nothing… ’nuff said.

UPDATE: Avangate serves the files.

Sixth, the license agreement.

  1. Make up your mind. The website I can see before I pay seems friendly and willing to allow me to install on any platform I have the know-how for, but the license says that installing it on any platform other than Windows, MacOS, or GNU/Linux (eg. FreeBSD) will terminate my license.
  2. I can only install it on two machines? Dammit, I forgot to pay attention to whether that “wait at least 6 months” rule was only for the evaluation version.
    Does my “moral high ground” rule mean that I can’t install it on both my Linux desktop and my Linux handheld until 6 months after I remove it from the XP machine I used for testing?

UPDATE: I had to clarify my concerns in my response. I’ve waiting for a reply.

Seventh, the RP9 files.

Not strictly Cloanto’s fault, but there are no Google results which mention that you can get more broadly compatible disk images from an RP9 using 7-Zip. I just figured it out by accident when I right-clicked one on the test machine.

UPDATE: I’ve suggested some minor adjustments to the knowledge base page which shows up in Google and/or the “RP9 Toolbox” software to draw more attention to the link to the RP9 spec which I missed.

Eighth, the games themselves.

Ok, so, dude, I bought this pack because I want to stay legal. Ya dig? …so why am I seeing a cracking group intro when I fire up B.C.’s Quest for Tires on the C64?

I seriously doubt the rightsholder for the game got permission to use the cracking group’s intellectual property and just because it’s an unauthorized derivative work doesn’t magically cause the rights to be forfeit.

I’m now stuck in one of those BS situations where I’m only “legal” because the guys I sided with have the bigger stick, not because they’re actually in the moral right.

…so, what did I pay for then? Kickstart ROMs and disk images that would have fallen into the public domain by now if Copyright hadn’t become corrupted and the warm, fuzzy feeling of having a slightly lighter wallet?

I’m really starting to understand why the GOG.com user base considers Cloanto to be at fault for GOG.com failing to negotiate a deal for the Kickstart ROMs so they could include Amiga games in their catalogue.

I’d say “I give up”, but that might be taken as “I’m going to start pirating” when, really, it just means that I’m probably going to buy fewer retro-games. I already have

They wonder why people pirate things when, even if you spend hours and tie yourself in knots trying to stay compliant with the letter of copyright law, your upstream suppliers are unilaterally deciding that a cracking group’s IP deserves no protection because it’s an unauthorized derivative work. It’s simply flat-out impossible to enjoy early cultural artifacts in the world of gaming and retain the moral high ground in a world of bit-rotting floppy discs. 🙁

UPDATE: They’ve actually brought this “We got the rights to the games, but what about the copyright on the code the crackers wrote?” issue up with the US Copyright Office multiple times.

Also, on the “GOG failed to negotiate a deal” front, Cloanto is apparently aiming to eventually get the Amiga/C64 IP to the point where it can be spun off as a non-profit… it’s just not as simple as I make it sound.

Finally, the convenience (or lack thereof).

I can only assume that Cloanto is mostly trying to compete with pirates based on convenience (like Steam does quite well)… but does this seem convenient to you?

  1. Install both things on my quarantined XP machine so I can be sure they’re not phoning home.
  2. Ask both tools to generate the promised ISO versions (and printable covers, since I’m going to this effort anyway) because using p7zip to unpack the installers on my Linux desktop without running them produces unhelpful filenames.
  3. Put everything including the ISOs, my purchase invoice, and a text file containing the registration keys into another DVD ISO so I know everything can be kept together nicely.
  4. Run all three of the aforementioned ISOs (official C64, official Amiga, combined backup) through dvdisaster to augment the raw ISO filesystem with forward error correction in case the discs start to bit-rot after my download links have expired.
  5. Burn all three to discs from the stockpile of Taiyo Yuden T02 DVD+R media that I use for archival (which, by the way, they no longer make).
  6. Write the order number and my name on all three discs so that they won’t look pirated if Cloanto goes out of business and their records become unavailable.
  7. Write the registration keys on the official media, since they won’t have them in the burned data.

UPDATE: Already addressed as a side-effect of addressing the earlier concerns.

…and no, I couldn’t just rely on pirated copies as my off-site backup. Those bits have the wrong colour.

Posted in Geek Stuff | Leave a comment

Simple Alarm Clock Script For Linux

TL;DR: Install python-dateutil, pytimeparse, and this script, then see the --help output for more details.

For a while, I’d been using the at command to schedule alarms when I needed to wake up in the morning, but I found that it was a fragile solution because of how MPlayer and its descendants interacted with PulseAudio’s session-centric setup and the presence or absence of a video output.

…and you really don’t want a fragile solution for your alarm clock, so I decided to write a little helper script that could run inside my quake-style terminal in my user session so it would Just Work™.

You’ll want to edit the hard-coded media player command it uses to actually play the alarm, but, otherwise, it should be pretty polished for something I just hacked together for my own use.

It’ll accept arguments in two forms:

  • wakeme at 6am
  • wakeme in 3 hours

(It accepts a great many formats for times and durations, so I’ll just point you at the docs for dateutil.parser.parse() (times) and pytimeparse (durations) for the complete list.)

Either one will cause it to echo back its interpretation of what you asked for (so you can double-check that it understood properly) and then sleep until it’s time to wake you.

Installation is as simple as:

  1. Make sure Python 2.x is installed (I haven’t tested 3.x)
  2. Install python-dateutil
  3. Install pytimeparse
  4. Put wakeme in your PATH
Posted in Geek Stuff | Leave a comment