How to skip the fortune command when your shell is slow to start

A.K.A. How to get and compare timestamps without external commands in shell script (and without even invoking subshells in Zsh)

I love the fortune command. It’s a charming little addition to each new tab I open… until something (like a nightly backup) has blown away my disk cache or a runaway memory leak is causing thrashing. Then, it’s just a big delay in getting to what I want to do.

The obvious solution to any non-shell programmer is to time everything and invoke fortune only if it’s not already taking too long, but shell script complicates that by having so few builtins. We don’t want to invoke an external process, because that would defeat the point of making fortune conditional, and we don’t want to invoke a subshell because, if we’re thrashing because of memory contention, that’ll also make things worse.

It turns out that bash 4.2 and above can get us half-way there by using a subshell to invoke the printf builtin with the %(%s)T token, but Zsh has a clever little solution that even reuses code that we’re going to need anyway: prompt substitutions!

Here’s the gist of how to pull it off:

# Top of .zshrc
local start_time="${(%):-"%D{%s}"}"

# -- Do all my .zshrc stuff here

local end_time="${(%):-"%D{%s}"}"
if [[ $(( end_time - start_time )) < 2 ]]; then
    if (( $+commands[fortune] )); then
    echo "Skipping fortune (slow startup)"

This is a standard “subtract start time from end time to get how long it took, then compare it to a threshold” check, so the only part that should need to be changed in bash is using start_time="$(printf "%(%s)T")". Instead, let’s pick apart how the Zsh version works:

  1. We start with a bog standard ${VAR:-DEFAULT} parameter expansion however, unlike bash, Zsh does consider ${:-always default} to be valid syntax.
  2. The (%) on the left-hand side is a special magic flag, similar to the (?i) syntax used for inline flag-setting in some regular expression engines. It enables prompt expansion of both the (nonexistent) variable’s contents and the fallback value.
  3. %D{...} is Zsh’s prompt expansion placeholder for putting strftime (man strftime(3)) timestamps into your prompt.
  4. %s is the strftime token for “seconds since the epoch”
  5. You have to quote the %D{...} or the ${...} consumes the closing curly brace too eagerly.

That’s the big magic thing. A way to write an equivalent to time(2) from the C standard library in pure Zsh script with no use of $(...) or zmodload and, since we’re using prompt expansion to do it, the only thing we might not already have needed to load into memory is the code for the %D{...} expansion token.

(Unfortunately, there’s no way to get sub-second precision with this approach, so the only two useful threshold values for a well-optimized zshrc are probably “1 second” and “2 seconds”.)

Now for that odd (( $+commands[fortune] )) way of checking for the presence of the fortune command. What’s up with that?

Well, it’s actually a micro-optimization that I use in my zsh-specific scripts. According to this guy’s tests, it runs in half the time the other options take and, in my own tests using his test scripts, I found that, depending on the circumstances, that could go as far as one tenth of the others, and that the others vary wildly relative to each other. (On runs where $+commands is 7 to 10 times as fast as type and which, hash is sometimes twice as fast as type or which and sometimes half as fast.)

Normally, this would be a moot point because any of the portable ways of checking for the existence of a command via a subshell and a builtin would be far too quick for it to matter (ie. I do it just for the heck of it) but, in this case, it felt appropriate.

(Another unnecessary micro-optimization that I didn’t use here was preferring [[ ]] over [ ] in my zshrc scripts. My tests found a million runs of [[ "$PWD" == "$HOME" ]] to take about 1.4 seconds, while a million runs of [ "$PWD" = "$HOME" ] took about 4.2 seconds.)

Posted in Geek Stuff | Leave a comment

On-Demand Loading for your .zshrc or .bashrc

Recently, I’ve been trying to make my coding environment snappier, and one thing I was never happy with was how slow my .zshrc is.

Now, don’t get me wrong, I’m not one of those people using oh-my-zsh with a ton of plugins and seeing 15-second waits for my shell to start… but I do want a new tab to be ready in a second or less.

So, I slapped zmodload zsh/zprof onto the top of my .zshrc, opened a new tab, and ran zprof | less …and 50% of the wait was in sourcing virtualenvwrapper, which I don’t feel like reinventing.

Time to take a lesson from the improvements I’ve been making to my .vimrc. Specifically, the { 'on': ['CommandA', 'CommandB'] } option hanging off the end of various lines for my plugin loader.

A little experimentation later and I came up with this construct:

function init_virtualenvwrapper {
    # Don't do anything if it's already loaded
    type virtualenvwrapper_workon_help &>/dev/null && return

    # ------------------------------------------------
    # normal stuff to load virtualenvwrapper goes here
    # ------------------------------------------------

for cmd in workon mkproject mkvirtualenv; do
    function $cmd {
        unset -f "$0"
        "$0" "$@"

For those not familiar with shell scripting, I’ll clarify.

For each shell function or command that I want to trigger deferred loading, I create a function with the same name that does the following:

  1. “Delete” itself so it won’t interfere with what virtualenvwrapper is going to set up. (You want to do this first to avoid removing what virtualenvwrapper just created)
  2. Call the virtualenvwrapper setup code to load the real command.
  3. init_virtualenvwrapper starts by checking for some side-effect of having been run before and exits early if that’s the case. (This keeps mkproject from re-doing what workon already did, or vice-versa.)
  4. Call the actual command and pass through any arguments.

Doing this means that:

  1. Your .zshrc or .bashrc startup time only pays the price for declaring a few shell functions. (And, if that gets too heavy for some reason, you could move init_virtualenvwrapper into another file and source it on demand.)
  2. Your first call to a wrapped command like workon will take longer. (eg. if it was adding two seconds to your shell start time, then your first call to it will take two seconds longer.)
  3. Subsequent calls to that or any other command sharing the same init_virtualenvwrapper will be as quick as usual.

Unfortunately, this design is actually Zsh-specific, which sucks for me because this is a file I share between .zshrc and .bashrc:

  1. Bash doesn’t support using a variable for a function name, so you can’t use a for loop. You’ll just get `$cmd': not a valid identifier.
  2. In my testing, functions didn’t set $0 in bash, so this will actually execute bash "$@", bringing you back to where you started, while zsh doesn’t set the FUNCNAME array variable that bash uses.

So, if you want to support both, here’s the most concise form I was able to put together:

function init_virtualenvwrapper {
    local _cmdname="$1"
    unset -f "$_cmdname"

    # Don't do anything if it's already loaded
    if ! type virtualenvwrapper_workon_help &>/dev/null; then
            # ----------------------------------------
            # normal stuff to load virtualenvwrapper
            # ----------------------------------------

    "$_cmdname" "$@"
# }}}

function workon {
    init_virtualenvwrapper "${FUNCNAME[0]:-$0}" "$@"
function mkproject {
    init_virtualenvwrapper "${FUNCNAME[0]:-$0}" "$@"
function mkvirtualenv {
    init_virtualenvwrapper "${FUNCNAME[0]:-$0}" "$@"

Anyway, I hope this helps to inspire anyone else who’s suffering from slow shell startup times.

UPDATE: And now, shortly after writing that, I discover that someone else went to the trouble of using eval to provide a nice API on top of this trick and put it up on GitHub as sandboxd. From that name, I can see why i didn’t find it before.

Posted in Geek Stuff | Leave a comment

So… What Does Your Government Do With Culture?

So often, asking people about culture is like asking a fish “How’s the water?” (The answer you’ll get is “What’s water?”) but it’s still useful to ask the question and, sometimes, you get interesting answers.

This time I’m wondering about ways your country’s government promoted the enrichment of culture and I think the best way to jog people’s memories is to give a bunch of examples of the kind of thing I’m talking about.

Everyone’s at least heard of government grant programs in the abstract (and Canada does have those. They’ve been instrumental in the creation of indie games I love, like Guacamelee, and shows like Mayday, a long-running docudrama series nominated for many awards which, unlike so many American ones, remembers that air crash investigations are detective stories first and human drama second)

Can you think of any “thanks to” credits for other government agencies or programs that show up in the credits of your favourite shows or on the websites of your favourite games?

…but, still, that’s kind of an obvious way to do it. What about stuff that’s less overtly “government promoting culture”?

Next down the progression of obviousness, there’s public broadcasting. Like the U.K., Canada has a public broadcaster (the CBC). It does produce excellent content of its own, such as the radio programs Quirks & Quarks and Because News (also available as podcasts), and it has adapted well to the Internet era (in addition to podcasts and the like, they also do print articles now), but it was actually ahead of its time. For quite a while before YouTube came around, you used to be able to watch complete archives of shows like Royal Canadian Air Farce on the CBC website in RealVideo format.

The U.K. actually has more than one public broadcaster. For everyone who knows about the BBC, how many of you know that Channel 4 (of Time Team fame) is also government-owned?

… but encountering your public broadcaster while channel-surfing is still too obvious. Let’s go deeper.

For example, since 1961, the CBC has been part of a partnership with House of Anansi Press and the University of Toronto to produce the Massey Lectures. If you like TED Talks, check them out. (I especially recommend Doris Lessing’s Prisons We Choose To Live Inside from 1985. It’s an amazing talk about human psychology that’s more relevant than ever, you can listen to it online for free, and, to my embarrassment, I didn’t know about it until the print version was assigned to me as reading in university.)

Can you think of anything your government contributes resources to along these lines? Recurring cultural events?

…how about PSAs that go beyond just being practical and help to spread culture? When I was a child, I don’t remember CBC television having commercials… though it’s possible they just had a reduced supply of them. The important thing is, their shows were formatted to leave room for a normal number of commercials. …so how did they fill that time?

Some other channels, such as the Family Channel (a kids channel which used to be commercial-free), filled the time with random pop music videos, but CBC did something a little more appropriate… they filled the time with shorts provided by other government-backed cultural enterprises like the National Film Board of Canada, and Canadian Heritage Minutes. Anyone who grew up in Canada is likely to fondly remember these things, so I’d say it was hugely effective.

Like so many kids, I forgot most of what I learned in history class, but I still remember about the amusing origin of the name Canada, the Halifax explosion, and the origin of Winnie the Pooh.

Likewise, what kid would know about Wade Hemsworth’s music if not for classic animated shorts like The Log Driver’s Waltz and The Blackfly Song produced by The National Film Board of Canada? (Not to mention the classic cartoon version of The Cat Came Back?)

Can you think of anything this engaging that your government actively produced or did they stick to purely functional pieces like Duck and Cover? (I’ll also accept stuff that isn’t distinctive to your local culture, but demonstrates that PSAs can be entertainment in their own right, such as Australia’s Dumb Ways to Die.)

Anyway, now we get to the stuff you take most for granted.

When I was a kid and I occasionally saw American money, I’d think “Huh. American money is ugly.” Much later, my father brought home a bunch of European coins. To my surprise, it turns out that it’s not that American money is ugly… it’s that Canadian money is uncommonly artistic. Of all the money I saw, the only other country with comparably beautiful coinage was Ireland. Don’t believe me? Scroll down to the pictures on these pages. (Note: I linked to pre-Euro Irish currency because that’s what I saw.)

The U.S., the U.K., Belgium, France, Switzerland, Germany… every other coin I saw had some boring piece of patriotic imagery or maybe a coat of arms, while Canadian and Irish coins were beautiful expressions of the culture of the nation in question. (Don’t believe me? Here’s the Canadian 50 dollar bill from 2004… that imagery commemorating The Famous Five wasn’t a special commemorative bill. That was the normal fifty.)

It makes it look as if all those cultures have deep insecurities, so they’re “compensating for something” with their patriotic imagery. It’s such a given that I love my country that, for most of my life, I never understood the point of putting up a Canadian flag on a non-government property. Why plaster the same patriotic imagery everywhere like graffiti when you can instead be expressing yourself, either by creating your own art or by displaying other people’s art which speaks to your sense of aesthetics.

…but enough of that tangent. Another example would be the Canadian flag. No tiny details like on Mexico’s flag, but still with more artistry recognizable to the common person than all those flags made of coloured rectangles and/or stars. (I’m not singling the U.S. out here. Look at France, the U.K., Russia, and countless other countries.)

It just seems like counties get boring when the topic of government art comes around. The only other flags that readily come to mind as having that kind of elegance are the Japanese and South Korean flags and the fern iconography that showed up in the New Zealand flag referendums.

…or, for that matter, look at the elegance of the T-130 wordmark [2] used on official Canadian government documents and signage, such as equally elegant T-605 primary identification signs.)

If you’re going to see something in so many places, why is it such a rare idea for government decision makers to grasp that it should be as aesthetically satisfying as possible?

It took me years to notice these things, so I’m really curious to see examples where Canada is the boring one and I’m just taking it for granted that thing X and thing Y are boring. (Does anyone have a really interesting national anthem? Canada’s seems to be just as boring as the American and Australian ones.)

Posted in Web Wandering & Opinion | Leave a comment

Forcing Firefox to Open CBZ Files Properly

If you’ve ever downloaded a .cbz file using Firefox and then tried to click it in the downloads panel, you might have noticed that Firefox ignores the association for the .cbz extension and instead opens it as a .zip file. (This isn’t the only situation where this happens and I filed a bug about it a year ago.)

I never got around to looking into why it doesn’t make the same mistake with .odt documents, which are also Zip files with a specialized extension, but I think you can see why I wouldn’t like it.

Here’s a quick little script that can be set as the Zip handler on a KDE-based desktop which will hand over to Ark under normal circumstances but, if it receives a file with a .cbz extension, will pop up a Yes/No dialog offering to open in Comix instead.

Anything which obeys the system’s associations properly will never trigger it, because .cbz files will be associated with Comix and it won’t pop up the dialog if fed a .zip file, so it’s about the best solution one can have without fixing Firefox.

This approach could also be easily extended to application/octet-stream to work around the other situation where I’ve seen this causing problems. (Patreon serving up image files with the wrong mimetype, if I remember correctly.)

Posted in Geek Stuff | Leave a comment

Displaying An Image or Animated GIF in Qt With Aspect Ratio-Preserving Scaling

When it comes to me and organizing images, GQView (now Geeqie) has always been a “best of a bunch of bad options” sort of thing and, with my move off Kubuntu 14.04 LTS, it’s become downright unusable in some cases. (eg. Freezing up at 100% CPU for several minutes to load certain collections)

As a result, I’ve been pushed to prioritize my efforts to replace at least the bare minimum subset of that functionality and, since I don’t want to rely on gtk3-mushrooms to make my own creations tolerable to me and Rust doesn’t have mature Qt bindings, that means PyQt5.

It’s not perfect (Qt doesn’t have incremental loading like GdkPixbufLoader, so I have to rely more heavily on my prototype code for asynchronously loading a bunch of upcoming images ahead in the background while I dawdle looking at the current one) but it’ll have to do… and I’ve filed a bug about that.

Now Qt has always been weird about how to get a displayed image to preserve its aspect ratio properly. It’s probably the one really glaring oversight in an otherwise very nicely designed and documented set of APIs. Given how much I had to fiddle around with things, I decided that I definitely wanted to share what I came up with.

What made it more difficult is that I’ve always wanted a GQView-alike which also displays animated GIFs with their animation, and Qt doesn’t have a unified solution for that. (QImage handles static images and QMovie handles GIF and MNG, but not actual movies, which you need to use the multimedia backend for.)

It turns out that getting smooth upscaling with QMovie is a tricky thing in itself because it’s very easy to accidentally build a widget tree that does the upscaling at a point in the pipeline where fast/ugly upscaling gets used, so a big thanks to Spencer on StackOverflow who figured it out.

Anyway, enough talk. Here’s the code:

(Yeah. I was too eager to post it, so this prototype hasn’t actually been split into the design which allows me to put the cache after the images get decoded. Still, it should be useful for most people.)

Posted in Geek Stuff | Leave a comment

On Dehumanization In Fiction

I have to admit it… I have a lot of drafts kicking around in my notes which most people would consider to be perfectly good blog posts but which, for me, were just flashes of inspiration that I wrote down to avoid losing them, but I never felt were “finished”.

While I was looking through the snips I’m accumulating for a book on writing, I rediscovered a couple which, looking at them now, are good enough to share, even if I still feel that there’s more insight to be teased out and more room for the style to be polished.

Dehumanization is at the heart of some of the most effective dark writing.

What hits harder than cruelty? Casual cruelty.
What hits harder than casual cruelty? Institutionalized cruelty.

Dropping a man in the wilderness, hundreds of miles from the nearest human, will cause hardship, but, if you want a man to despair, drop him into the heart of a big city, penniless, alone, and ignored by all who pass… and that’s just from cruelty by neglect.

Humans are social animals to our very core, and slavery is abhorrent precisely because its institutionalized cruelty at its most powerful… forcing the reader to not only observe active dehumanization on a mass scale, but to confront how flawed their optimistic preconceptions of human nature are in a way that rings too true for them to deny.

(Humanity’s social nature is also why solitary confinement is considered torture in many places, but this is much more difficult to communicate to someone who hasn’t experienced it personally.)

Looking at it from another angle, it’s also so powerful because of the specific kinds of emotions it evokes in the reader/viewer via their sense of empathy. It’s not just that the character is experiencing misery or defeat or isolation, it’s that their circumstances evoke a sense of despair AND powerlessness, futility AND hopelessness.

Most telling, I think, is how Chip Conley pseudo-mathematically expressed despair: suffering without meaning… and isn’t that also the perfect starting point for a definition of my own term, “Hardship Porn”. (Fiction where, through intent or incompetence, the author seems to revel in making their hero’s life miserable, not because it makes the writing more powerful but just to gratify some emotional need.)

I also made a related observation that slavery is powerful because it tends to associate itself well to two kinds of atrocities which fall under the other major class of violations we readily recognize: the sanctity of self.

To wilfully and permanently disfigure someone’s body against their desires, or to attack their very psyche, is the most personal form of dehumanization possible… denying you control over the only things that are unarguably, undeniably, unquestionably your own and attacking your thoughts, the one hiding place nobody should ever intrude… let alone tamper with. It is no accident that, as a species who think in metaphor, we often refer to the body as a temple and the mind as a sanctum.

Posted in Writing | Leave a comment

How to Keep Humans From Seeing Your reCAPTCHA

I don’t know how many people know this, but reCAPTCHA is a major pain if you’ve configured your browser to prevent Google from doing things like setting tracking cookies or fingerprinting your <canvas>. Sometimes, it’ll take me a minute or more before the bleeping thing lets me through.

So, for my own sites, I’m very reluctant to make people fill out CAPTCHAs. (Plus, there’s also an aspect of “Is this what we’ve been reduced to? Taking for granted that we must constantly pester legitimate users to prove that they’re human because we’re letting the bad actors set the terms of engagement?”)

Note that I will not be covering the pile of techniques that require JavaScript to implement because, as a dedicated uMatrix user, I find those to also be annoying, though nowhere near as much as reCAPTCHA.

So, let’s think about this problem for a second. What can we do to improve things by reducing the need to display reCAPTCHA?

Well, first let’s think about the types of spam we’re going to receive. I’ve noticed two types, and I’ll start by addressing the kind CAPTCHAs don’t prevent:

Human-Sent Spam

Believe it or not, several times a year, I would receive spam that’s clearly been sent by a human, trying to promote some shady service they think I’ll want (typically SEO or paid traffic).

I tried putting up a message which clearly states that the contact form on this blog is not for this sort of message, but I still occasionally get someone who ignores it… so what more can be done?

Well, I can’t do it with my current WordPress plugin but, for my other sites, how about trying to make sure they actually read it, and making it sound scarier for them to ignore it?

The simplest way to do this is to add a checkbox that says something like “I hereby swear under penalty of perjury that this message is not intended to solicit customers for any form of commercial service” like I did for the GBIndex contact form.

Since you’re guarding against an actual human this time, using a normal browser, you don’t even need any server-side code. Just set required="required" in the checkbox’s markup and their browser will refuse to submit the form until they check the box and draw their attention to it, which is exactly what we want.

Of course, you want it to be clear that it’s not toothless stock text, so there are two other things you should do:

  1. Don’t just copy-paste my phrasing. Identical text is only good in such a declaration if the readers associate consistency with “this has the force of law and has been tested in actual court cases” rather than “this is a stock snip of HTML from www.TopHTMLSnips.blort”
  2. Include a highly visible message somewhere on the page which makes it clear that, if they just blindly check the box, you’ll report whatever they’re promoting to their service providers (domain registrars, web hosts, etc.) for Terms of Service violations.

    (and do follow through. For example, use the global WHOIS database to identify the domain registrar, then use the registrar’s “Report Abuse” link in their site footer or support section. Then use the registrar’s WHOIS lookup service to identify the nameserver provider and use their “Report Abuse” link. If you think the hosting may be with a shared hosting provider different from the nameserver provider, you can use techniques like doing a DNS lookup on on the domain, then reverse DNS lookups on the resulting IP addresses.)

You could also put a Bayesian filter to work on your inbox, but I’m always wary of false positives and don’t want to have to sift through a spam box periodically, so I try to avoid that… and this works well enough.

OK, so, with that out of the way, let’s get to what CAPTCHAs are meant to stop…

Bot-Sent Spam

There are two kinds of bot-sent spam. Stuff meant to be read by humans, and stuff meant to be read by machines. Since some of the techniques used for preventing machine-targeted spam also help to stem the tide of stuff aimed at humans, we’ll address those first.

In both cases, you can certainly apply a Bayesian filter but, as with human-sent spam, I aim for something more deterministic.

Machine-Readable Bot Spam

Machine-readable spam is spam intended to evoke a reaction from another machine. The most typical example of this is manipulating search results by scattering links to their garbage all over the web.

The key to combating machine-readable spam is recognizing that, if the target machine can understand the important characteristics of the message, so can your spam-prevention measures.

1. Block Link Markup

The first layer of protection I like to apply is to detect disallowed markup and present a human-readable message explaining what changes must be made for the message to be accepted.

For example, in my contact forms, which are going to be rendered as plaintext e-mails, the spam that gets submitted comes from bots that mistake them for blog comment fields, and 99% of that can be killed simply by disallowing </a>, [/url], and [/link] in messages, and instructing users to switch to bare URLs.

This is mainly about making the reCAPTCHA less necessary, meaning that you don’t have to trigger it as aggressively, but it also has the added benefit of ensuring that legitimate messages look nicer when I read them.

Spambots can submit bare URLs to get around this, but they generally don’t because it would make their SEO-spamming less effective on sites which don’t block URL markup and my site is nowhere near important enough to get a purpose-built spambot. (And, even if it did, I’d want to keep the check to correct legitimate users’ misconceptions about what markup will actually get interpreted when I see their message.)

2. Detect URLs

A tiny fraction of the spambots I see do submit bare URLs, and we don’t want a solution which will become ineffective if applied broadly enough for spammers to adapt, so the next step is to handle the grey areas… the stuff that has legitimate uses, but also spammy ones.

The simplest way to handle this is to match on a string of text that’s essential for any sort of auto-hyperlinking to function, and then trigger stronger scrutiny (eg. reCAPTCHA) as a result.

For this, I use a regular expression. I use something like (http|ftp)s?:// because my regex is shared with other functionality, but a simple string match on :// would probably do the trick while also catching “let the human change it back” obfuscation attempts like hxxp:// in spam meant only to be read by humans.

I haven’t encountered any spam which uses URLs without the scheme portion but, if you want to guard against auto-hyperlinkable URLs of that form, also check for www.

3. Do some simple sanity checks on the text

Spambots tend to be written very shoddily, so they submit some stuff so broken it’s funny at times. (One bot tried to submit the un-rendered contents of the template it was supposed to use to generate spam messages.)

A few times a year, I would get one such submission which was clearly a variation on common SEO-spam I was already blocking… but it had no URLs in it… just the placeholder text meant to pad out the message.

I decided to block that by adding the following check, which takes maybe three or four lines of code:

  1. Split the message up by whitespace (explode in PHP, split in Python or JavaScript, etc.)
  2. If the splitting function doesn’t support collapsing heterogeneous runs of whitespace characters (*cough*JavaScript*cough*), ignore any empty/whitespace-only “words”.
  3. Count up the words which do and don’t contain URLs (:// or whatever)
  4. If there are fewer than some minimum number of non-URL words or the percentage of non-URL words relative to URLs is too low, reject the message with something like “I don’t like walls of URLs. Please add some text explaining what they are and why you’re sending them to me.”)

Admittedly, some bots use blocks of text stolen from random blogs as padding, which will pass this test, but the point is to whittle away the lazier ones. Also, it can’t hurt, because you’re guarding against stuff you wouldn’t want from a human either:

  1. There’s a minimum length below which a message probably isn’t worth the effort to read. (For ongoing conversations, this will be low, because you want to block things like “+1” and “first” but allow things like “Looks good to me” but, for forms that only handle the initial message, like e-mail forms or the “new topic” form on a forum, the minimum can be higher. I advise “at least three words” as the limit for the ongoing case because “subject verb object”.)
  2. A human can easily pad out a too-short message and re-submit, but a bot won’t know what to do.
  3. It’s rude to send text that’s so URL-heavy that you’re not even giving each URL a friendly title, regardless of whether it’s a bot or a human submitting them.

WebAIM also suggested checking whether fields which shouldn’t be the same contain identical data. I don’t know if spambots which do that to unrecognized fields are still around, but I don’t see how it could hurt… just be careful to avoid the particular firstname/lastname example he gave, where sheer probability suggests that you’ll encounter someone with a name like “James James” or “Anthony Anthony” eventually. If nothing else, maybe it’ll catch lazy humans trying to fill in fake account details.

(Note that all of these sanity checks are structural. We don’t want to resort to a blacklist.)

4. Add a Honeypot

Bots like to fill out form fields. It minimizes the chance that the submission will get blocked because one of the fields is required. This is something else we can exploit.

The trick is simple. Make a field that is as attractive to the bot as possible, then tell the humans not to fill it out in natural language which the bot can’t parse. The things to keep in mind are:

  1. Don’t hide your honeypot field from humans using display: none in your CSS. Bots are getting good at parsing CSS.

    Instead, push it off the left edge of the viewport using position: absolute; so the bot has to assume that, by filling it out, it’s taking a shortcut around clicking through some kind of single-page wizard.

    (Under that rationale, you could also try hiding it using JavaScript. The important thing is to recognize that good spambots are as smart as screen readers for the blind… they just can’t understand natural language like the human behind the screen reader can.)
  2. Name your honeypot field something attractive, like url or phone or password. (url is a good one for e-mail contact forms, because you’re unlikely to need an actual URL field and that’s what WordPress’s blog comment form uses.)
  3. Set autocomplete="off" on the field so the browser won’t accidentally cause legitimate users to fail the test.
  4. Set tabindex="-1" or, if spambots start to get wise to that, explicitly put it after everything else in the tabbing order including the submit button. That way, if it becomes visible (eg. you’re hiding it using JavaScript and JavaScript is disabled) or the user’s screen reader allows them to get into it despite it being hidden, it won’t interfere with filling out the form.
  5. Use a <label for="name_of_the_field"> to provide the message about not filling it in so that assistive technologies can reliably present the message to the human.

Also, consider going light on the HTML5 validation in your other fields. I’ve heard people say that it helps to stop spambots, but I’m not sure how long ago that was and it’s never good to expose the rules defining valid input for a bot to learn from when you could be keeping them server-side and only explaining them to legitimate users in natural language.

I’ve seen multiple suggestions to scramble up the field names for your form fields, so name="url" actually expects a valid e-mail and so on, but this harms maintainability for your code and could scramble up the form auto-fill in browsers like Chrome, so I’d only do it if necessary.

5. Do some simple sanity checks on the user agent

I haven’t needed to do this on the sites I wrote myself (The previous techniques were enough) but, if you need more (or if you’re using something PHP-based like WordPress where you can just hook up Bad Behaviour and call it a day), here are some other things that bottom-of-the-barrel spambot code might get wrong:

  1. Still using the default User-Agent string for whatever HTTP library they use. (eg. cURL, Python’s urllib, etc.)
  2. No User-Agent string.
  3. Typos in the User-Agent string (eg. whitespace present/missing in the wrong places or a typo’d browser/OS name)
  4. Claiming to be some ancient browser/OS that your site isn’t even compatible with
  5. Sending HTTP request headers that are invalid for the HTTP protocol version requested (added in a later version, only allowed in earlier versions, actually a response header, etc.)
  6. Sending the User-Agent string for a major browser but sending request headers which clearly disagree. (eg. not Accept-ing content types that the browser has had built-in support for since the stone age.)
  7. Not setting the Referer header correctly (but be careful. Extensions like uMatrix may forge this to always point to your site root to prevent tracking so you want to require the expected value or a whitelist of what privacy extensions are known to forge to.)
  8. Sending request header values that aren’t allowed by the spec
  9. Sending custom headers that are only set by unwanted user agents
  10. Obvious signs of headless browsers.
  11. Adding/removing unexpected GET parameters on a POST request. (When you submit via POST, it’s still possible to pass things in via query parameters, so sanity-check that… just be careful that, if you’re verifying on the GET request which loads the form, that you account for things other sites might add on the off chance that you use something like Google Analytics.)
  12. Adding/removing unexpected POST parameters. (If a bot is trying to take shortcuts, you might see it missing or filling things a real user wouldn’t.)

…and, of course, sanitize and validate your inputs. (eg. WebAIM points out that spambots might try e-mail header injection, which would be a sure-fire sign of a malicious actor that you can block.)

I’m reluctant to suggest rate-limiting or IP blacklisting as a general solution, since rate-limiting requests is more for protecting against scraping and it’s easy for spammers to botnet their way around IP blacklists while leaving a minefield of blacklisted IPs for legitimate users to receive from their ISP the next time they disconnect and DHCP gives them a new IP address. (Plus, I can’t be the only person who middle-clicks one link, waits for it to load, middle-clicks 10 in rapid succession, and then reads the first while the other ten load.)

However, rate-limiting HTTP POST requests probably is a good idea. I may do a lot of things in parallel, but I’m not sure I’ve ever submitted multiple POST forms on the same site within a five-second window. Heck, even “Oops. I typo’d my search. Let’s try again.” may take longer than five seconds. (And that’s usually a GET request.)

Speaking of crawling, bots have to find your form somehow. While I doubt rate-limiting is going to be useful enough to be worthwhile, what I would suggest is to blacklist robots from your forms using robots.txt and then, using an identically-structured rule, also blacklist a link which immediately blacklists any IP which requests it. This will stop bots which are not only ignoring robots.txt, but using it to find forms.

I’d also suggest adding a link to a “Click here to blacklist your IP address”-style page so spambots which don’t read robots.txt at all can still get caught but curious users who find the link don’t blacklist themselves by accident. (Just remember that the same guidelines apply as for the honeypot field. Don’t display: none or visibility: hidden to hide it because spambots may be wise to that. Thanks to for this idea.)

Measuring the time between loading the page and posting can also be helpful, but you have to be very careful about your assumptions. Measure how long it’ll take a user to load/reload the page (on a really fast connection with JavaScript and external resources disabled) and then paste some text they wrote previously. (eg. I tend to compose my posts in a separate text editor because I haven’t found a form recovery extension I like.)

If you decide to do that, you’ll want to make sure that the bot can’t just change the page-load timestamp. There are two ways I can see to accomplish that:

  1. If your framework supports it, regenerate the CSRF token every time the page containing the form is loaded and, when the form gets submitted, check that the token you receive was generated at least X amount of time ago. (3 seconds is a good starting value)
  2. If you can’t do that for some reason, use something like HMAC to generate a hash for the timestamp and then send both the timestamp and hash to the client in a hidden form field. Without the secret key you’re holding, the bot can’t change the timestamp without invalidating the hash.

Another trick similar to a CSRF token is to serve up an image (like a tracking pixel, but served locally so it doesn’t get blocked) from a dynamic route. When the route handler gets called, have it make a note of the current CSRF token for the session. Then, when the form is submitted, and after checking that the CSRF token is present and valid, verify that the image was loaded and the CSRF token at that time matches the current CSRF token.

That’ll block any bot that tries to save time and bandwidth by not attempting to load images. It’s similar in concept to some of the JavaScript checks, but the odds that a legitimate user who disables JavaScript will also disable the loading of images are minuscule. (Thanks to Alain Tiemblo for the idea)

6. Prefer Structured Input

If you’re accepting submissions for a custom site, rather than just slapping up a basic comment form, structured input isn’t just a way to let submitters do some of the legwork for you.

Every additional field is another opportunity to trip the bot up by expecting it to auto-fill something that can’t be satisfied by randomly generated garbage or plagiarized snippets of someone else’s blog and has requirements only explained in human-readable text.

Structured input also makes your form look less like a blog comment or forum reply form, which may help to deter some smarter spambots.

7. Use Multi-Stage Submission

This one was suggested by WebAIM. The idea being that, if your form enters the submission into the database in some kind of draft form which will time out if not confirmed, and then returns a “Here’s a preview of how your submission will look. Please check it for errors” page that doesn’t contain the submitted fields but, rather, a submission ID and a “Confirm” button, the spambot may not be smart enough to complete the process.

I like this idea because it doesn’t feel like a CAPTCHA or an anti-spam measure to the end user… just a reasonable thing to ask the user to do to make life a little more convenient for whoever’s going to see what was received. (Plus, I find that having a preview separate from the editor helps me to notice my mistakes more readily.)

Human-Oriented Bot Spam

If you’ve ever actively followed a large site that uses Disqus for its comments, you’ve probably noticed that, before the moderators get to them, spam comments which slip through are trying to outwit spam filters by using look-alike characters. Unfortunately, due to limitations in how WordPress handles Unicode, I can’t show you an example of such a thing. (See here)

Now, if the spammer is still keeping the URLs in a form that can be clicked or copied and pasted, you may not need this… but if you can’t afford to require users to fill out a CAPTCHA every time they post, the Unicode people have developed what’s known as the TR39 Skeleton Algorithm for Unicode Confusables.

The basic idea is that, with the help of a big table, people can implement the algorithm for your language of choice (and have done so… usually under some variant of the name “confusables”. The PHP standard library includes one named Spoofchecker) and you can then go skeleton(string_1) == skeleton(string_2) to compare them without the obfuscation.

That said, it’s not quite that simple. The skeleton algorithm intentionally does not duplicate the process of normalizing uppercase vs. lowercase or ignoring combining characters, so you’ll need to do those first as preprocessing steps.

While I haven’t exhaustively tested it, my intuition is that this is the best way to skeletonize your text for spam detection:

  1. Normalize to NFKD and strip combining characters. (Eevee’s The Dark Corners of Unicode has a Python example and explains why you normally don’t want to do this, but the same issues apply to the TR39 skeleton algorithm itself, so it should be fine here.)
  2. Lowercase/uppercase the strings to be skeletonized (Do this after normalizing in case there exist precomposed glyphs with no alternative-case forms in the locale you’re operating under)
  3. Strip out all whitespace characters (To prevent things like “m a k e  m o n e y  a t  h o m e” and remove hidden spaces such as zero-width joiners)
  4. Run the TR39 skeleton algorithm on both strings.

Your strings should now be ready for use as input to whatever system you want to use to assess the probability of spam. (Check out this StackOverflow question if you want to train your own classifier and don’t have a spam corpus handy.)

Posted in Geek Stuff | Leave a comment