I don’t know how many people know this, but reCAPTCHA is a major pain if you’ve configured your browser to prevent Google from doing things like setting tracking cookies or fingerprinting your <canvas>
. Sometimes, it’ll take me a minute or more before the bleeping thing lets me through.
So, for my own sites, I’m very reluctant to make people fill out CAPTCHAs. (Plus, there’s also an aspect of “Is this what we’ve been reduced to? Taking for granted that we must constantly pester legitimate users to prove that they’re human because we’re letting the bad actors set the terms of engagement?”)
Note that I will not be covering the pile of techniques that require JavaScript to implement because, as a dedicated uMatrix user, I find those to also be annoying, though nowhere near as much as reCAPTCHA.
So, let’s think about this problem for a second. What can we do to improve things by reducing the need to display reCAPTCHA?
Well, first let’s think about the types of spam we’re going to receive. I’ve noticed two types, and I’ll start by addressing the kind CAPTCHAs don’t prevent:
Human-Sent Spam
Believe it or not, several times a year, I would receive spam that’s clearly been sent by a human, trying to promote some shady service they think I’ll want (typically SEO or paid traffic).
I tried putting up a message which clearly states that the contact form on this blog is not for this sort of message, but I still occasionally get someone who ignores it… so what more can be done?
Well, I can’t do it with my current WordPress plugin but, for my other sites, how about trying to make sure they actually read it, and making it sound scarier for them to ignore it?
The simplest way to do this is to add a checkbox that says something like “I hereby swear under penalty of perjury that this message is not intended to solicit customers for any form of commercial service” like I did for the GBIndex contact form.
Since you’re guarding against an actual human this time, using a normal browser, you don’t even need any server-side code. Just set required="required"
in the checkbox’s markup and their browser will refuse to submit the form until they check the box and draw their attention to it, which is exactly what we want.
Of course, you want it to be clear that it’s not toothless stock text, so there are two other things you should do:
- Don’t just copy-paste my phrasing. Identical text is only good in such a declaration if the readers associate consistency with “this has the force of law and has been tested in actual court cases” rather than “this is a stock snip of HTML from www.TopHTMLSnips.blort”
- Include a highly visible message somewhere on the page which makes it clear that, if they just blindly check the box, you’ll report whatever they’re promoting to their service providers (domain registrars, web hosts, etc.) for Terms of Service violations.
(and do follow through. For example, use the global WHOIS database to identify the domain registrar, then use the registrar’s “Report Abuse” link in their site footer or support section. Then use the registrar’s WHOIS lookup service to identify the nameserver provider and use their “Report Abuse” link. If you think the hosting may be with a shared hosting provider different from the nameserver provider, you can use techniques like doing a DNS lookup on on the domain, then reverse DNS lookups on the resulting IP addresses.)
You could also put a Bayesian filter to work on your inbox, but I’m always wary of false positives and don’t want to have to sift through a spam box periodically, so I try to avoid that… and this works well enough.
OK, so, with that out of the way, let’s get to what CAPTCHAs are meant to stop…
Bot-Sent Spam
There are two kinds of bot-sent spam. Stuff meant to be read by humans, and stuff meant to be read by machines. Since some of the techniques used for preventing machine-targeted spam also help to stem the tide of stuff aimed at humans, we’ll address those first.
In both cases, you can certainly apply a Bayesian filter but, as with human-sent spam, I aim for something more deterministic.
Machine-Readable Bot Spam
Machine-readable spam is spam intended to evoke a reaction from another machine. The most typical example of this is manipulating search results by scattering links to their garbage all over the web.
The key to combating machine-readable spam is recognizing that, if the target machine can understand the important characteristics of the message, so can your spam-prevention measures.
1. Block Link Markup
The first layer of protection I like to apply is to detect disallowed markup and present a human-readable message explaining what changes must be made for the message to be accepted.
For example, in my contact forms, which are going to be rendered as plaintext e-mails, the spam that gets submitted comes from bots that mistake them for blog comment fields, and 99% of that can be killed simply by disallowing </a>
, [/url]
, and [/link]
in messages, and instructing users to switch to bare URLs.
This is mainly about making the reCAPTCHA less necessary, meaning that you don’t have to trigger it as aggressively, but it also has the added benefit of ensuring that legitimate messages look nicer when I read them.
Spambots can submit bare URLs to get around this, but they generally don’t because it would make their SEO-spamming less effective on sites which don’t block URL markup and my site is nowhere near important enough to get a purpose-built spambot. (And, even if it did, I’d want to keep the check to correct legitimate users’ misconceptions about what markup will actually get interpreted when I see their message.)
2. Detect URLs
A tiny fraction of the spambots I see do submit bare URLs, and we don’t want a solution which will become ineffective if applied broadly enough for spammers to adapt, so the next step is to handle the grey areas… the stuff that has legitimate uses, but also spammy ones.
The simplest way to handle this is to match on a string of text that’s essential for any sort of auto-hyperlinking to function, and then trigger stronger scrutiny (eg. reCAPTCHA) as a result.
For this, I use a regular expression. I use something like (http|ftp)s?://
because my regex is shared with other functionality, but a simple string match on ://
would probably do the trick while also catching “let the human change it back” obfuscation attempts like hxxp://
in spam meant only to be read by humans.
I haven’t encountered any spam which uses URLs without the scheme portion but, if you want to guard against auto-hyperlinkable URLs of that form, also check for www.
3. Do some simple sanity checks on the text
Spambots tend to be written very shoddily, so they submit some stuff so broken it’s funny at times. (One bot tried to submit the un-rendered contents of the template it was supposed to use to generate spam messages.)
A few times a year, I would get one such submission which was clearly a variation on common SEO-spam I was already blocking… but it had no URLs in it… just the placeholder text meant to pad out the message.
I decided to block that by adding the following check, which takes maybe three or four lines of code:
- Split the message up by whitespace (
explode
in PHP, split
in Python or JavaScript, etc.) - If the splitting function doesn’t support collapsing heterogeneous runs of whitespace characters (*cough*JavaScript*cough*), ignore any empty/whitespace-only “words”.
- Count up the words which do and don’t contain URLs (
://
or whatever) - If there are fewer than some minimum number of non-URL words or the percentage of non-URL words relative to URLs is too low, reject the message with something like “I don’t like walls of URLs. Please add some text explaining what they are and why you’re sending them to me.”)
Admittedly, some bots use blocks of text stolen from random blogs as padding, which will pass this test, but the point is to whittle away the lazier ones. Also, it can’t hurt, because you’re guarding against stuff you wouldn’t want from a human either:
- There’s a minimum length below which a message probably isn’t worth the effort to read. (For ongoing conversations, this will be low, because you want to block things like “+1” and “first” but allow things like “Looks good to me” but, for forms that only handle the initial message, like e-mail forms or the “new topic” form on a forum, the minimum can be higher. I advise “at least three words” as the limit for the ongoing case because “subject verb object”.)
- A human can easily pad out a too-short message and re-submit, but a bot won’t know what to do.
- It’s rude to send text that’s so URL-heavy that you’re not even giving each URL a friendly title, regardless of whether it’s a bot or a human submitting them.
WebAIM also suggested checking whether fields which shouldn’t be the same contain identical data. I don’t know if spambots which do that to unrecognized fields are still around, but I don’t see how it could hurt… just be careful to avoid the particular firstname/lastname example he gave, where sheer probability suggests that you’ll encounter someone with a name like “James James” or “Anthony Anthony” eventually. If nothing else, maybe it’ll catch lazy humans trying to fill in fake account details.
(Note that all of these sanity checks are structural. We don’t want to resort to a blacklist.)
4. Add a Honeypot
Bots like to fill out form fields. It minimizes the chance that the submission will get blocked because one of the fields is required. This is something else we can exploit.
The trick is simple. Make a field that is as attractive to the bot as possible, then tell the humans not to fill it out in natural language which the bot can’t parse. The things to keep in mind are:
- Don’t hide your honeypot field from humans using
display: none
in your CSS. Bots are getting good at parsing CSS.
Instead, push it off the left edge of the viewport using position: absolute;
so the bot has to assume that, by filling it out, it’s taking a shortcut around clicking through some kind of single-page wizard.
(Under that rationale, you could also try hiding it using JavaScript. The important thing is to recognize that good spambots are as smart as screen readers for the blind… they just can’t understand natural language like the human behind the screen reader can.) - Name your honeypot field something attractive, like
url
or phone
or password
. (url
is a good one for e-mail contact forms, because you’re unlikely to need an actual URL field and that’s what WordPress’s blog comment form uses.) - Set
autocomplete="off"
on the field so the browser won’t accidentally cause legitimate users to fail the test. - Set
tabindex="-1"
or, if spambots start to get wise to that, explicitly put it after everything else in the tabbing order including the submit button. That way, if it becomes visible (eg. you’re hiding it using JavaScript and JavaScript is disabled) or the user’s screen reader allows them to get into it despite it being hidden, it won’t interfere with filling out the form. - Use a
<label for="name_of_the_field">
to provide the message about not filling it in so that assistive technologies can reliably present the message to the human.
Also, consider going light on the HTML5 validation in your other fields. I’ve heard people say that it helps to stop spambots, but I’m not sure how long ago that was and it’s never good to expose the rules defining valid input for a bot to learn from when you could be keeping them server-side and only explaining them to legitimate users in natural language.
I’ve seen multiple suggestions to scramble up the field names for your form fields, so name="url"
actually expects a valid e-mail and so on, but this harms maintainability for your code and could scramble up the form auto-fill in browsers like Chrome, so I’d only do it if necessary.
5. Do some simple sanity checks on the user agent
I haven’t needed to do this on the sites I wrote myself (The previous techniques were enough) but, if you need more (or if you’re using something PHP-based like WordPress where you can just hook up Bad Behaviour and call it a day), here are some other things that bottom-of-the-barrel spambot code might get wrong:
- Still using the default User-Agent string for whatever HTTP library they use. (eg. cURL, Python’s urllib, etc.)
- No User-Agent string.
- Typos in the User-Agent string (eg. whitespace present/missing in the wrong places or a typo’d browser/OS name)
- Claiming to be some ancient browser/OS that your site isn’t even compatible with
- Sending HTTP request headers that are invalid for the HTTP protocol version requested (added in a later version, only allowed in earlier versions, actually a response header, etc.)
- Sending the User-Agent string for a major browser but sending request headers which clearly disagree. (eg. not
Accept
-ing content types that the browser has had built-in support for since the stone age.) - Not setting the
Referer
header correctly (but be careful. Extensions like uMatrix may forge this to always point to your site root to prevent tracking so you want to require the expected value or a whitelist of what privacy extensions are known to forge to.) - Sending request header values that aren’t allowed by the spec
- Sending custom headers that are only set by unwanted user agents
- Obvious signs of headless browsers.
- Adding/removing unexpected
GET
parameters on a POST
request. (When you submit via POST
, it’s still possible to pass things in via query parameters, so sanity-check that… just be careful that, if you’re verifying on the GET
request which loads the form, that you account for things other sites might add on the off chance that you use something like Google Analytics.) - Adding/removing unexpected
POST
parameters. (If a bot is trying to take shortcuts, you might see it missing or filling things a real user wouldn’t.)
…and, of course, sanitize and validate your inputs. (eg. WebAIM points out that spambots might try e-mail header injection, which would be a sure-fire sign of a malicious actor that you can block.)
I’m reluctant to suggest rate-limiting or IP blacklisting as a general solution, since rate-limiting requests is more for protecting against scraping and it’s easy for spammers to botnet their way around IP blacklists while leaving a minefield of blacklisted IPs for legitimate users to receive from their ISP the next time they disconnect and DHCP gives them a new IP address. (Plus, I can’t be the only person who middle-clicks one link, waits for it to load, middle-clicks 10 in rapid succession, and then reads the first while the other ten load.)
However, rate-limiting HTTP POST
requests probably is a good idea. I may do a lot of things in parallel, but I’m not sure I’ve ever submitted multiple POST
forms on the same site within a five-second window. Heck, even “Oops. I typo’d my search. Let’s try again.” may take longer than five seconds. (And that’s usually a GET
request.)
Speaking of crawling, bots have to find your form somehow. While I doubt rate-limiting is going to be useful enough to be worthwhile, what I would suggest is to blacklist robots from your forms using robots.txt
and then, using an identically-structured rule, also blacklist a link which immediately blacklists any IP which requests it. This will stop bots which are not only ignoring robots.txt
, but using it to find forms.
I’d also suggest adding a link to a “Click here to blacklist your IP address”-style page so spambots which don’t read robots.txt
at all can still get caught but curious users who find the link don’t blacklist themselves by accident. (Just remember that the same guidelines apply as for the honeypot field. Don’t display: none
or visibility: hidden
to hide it because spambots may be wise to that. Thanks to fleiner.com for this idea.)
Measuring the time between loading the page and posting can also be helpful, but you have to be very careful about your assumptions. Measure how long it’ll take a user to load/reload the page (on a really fast connection with JavaScript and external resources disabled) and then paste some text they wrote previously. (eg. I tend to compose my posts in a separate text editor because I haven’t found a form recovery extension I like.)
If you decide to do that, you’ll want to make sure that the bot can’t just change the page-load timestamp. There are two ways I can see to accomplish that:
- If your framework supports it, regenerate the CSRF token every time the page containing the form is loaded and, when the form gets submitted, check that the token you receive was generated at least X amount of time ago. (3 seconds is a good starting value)
- If you can’t do that for some reason, use something like HMAC to generate a hash for the timestamp and then send both the timestamp and hash to the client in a hidden form field. Without the secret key you’re holding, the bot can’t change the timestamp without invalidating the hash.
Another trick similar to a CSRF token is to serve up an image (like a tracking pixel, but served locally so it doesn’t get blocked) from a dynamic route. When the route handler gets called, have it make a note of the current CSRF token for the session. Then, when the form is submitted, and after checking that the CSRF token is present and valid, verify that the image was loaded and the CSRF token at that time matches the current CSRF token.
That’ll block any bot that tries to save time and bandwidth by not attempting to load images. It’s similar in concept to some of the JavaScript checks, but the odds that a legitimate user who disables JavaScript will also disable the loading of images are minuscule. (Thanks to Alain Tiemblo for the idea)
6. Prefer Structured Input
If you’re accepting submissions for a custom site, rather than just slapping up a basic comment form, structured input isn’t just a way to let submitters do some of the legwork for you.
Every additional field is another opportunity to trip the bot up by expecting it to auto-fill something that can’t be satisfied by randomly generated garbage or plagiarized snippets of someone else’s blog and has requirements only explained in human-readable text.
Structured input also makes your form look less like a blog comment or forum reply form, which may help to deter some smarter spambots.
7. Use Multi-Stage Submission
This one was suggested by WebAIM. The idea being that, if your form enters the submission into the database in some kind of draft form which will time out if not confirmed, and then returns a “Here’s a preview of how your submission will look. Please check it for errors” page that doesn’t contain the submitted fields but, rather, a submission ID and a “Confirm” button, the spambot may not be smart enough to complete the process.
I like this idea because it doesn’t feel like a CAPTCHA or an anti-spam measure to the end user… just a reasonable thing to ask the user to do to make life a little more convenient for whoever’s going to see what was received. (Plus, I find that having a preview separate from the editor helps me to notice my mistakes more readily.)
Human-Oriented Bot Spam
If you’ve ever actively followed a large site that uses Disqus for its comments, you’ve probably noticed that, before the moderators get to them, spam comments which slip through are trying to outwit spam filters by using look-alike characters. Unfortunately, due to limitations in how WordPress handles Unicode, I can’t show you an example of such a thing. (See here)
Now, if the spammer is still keeping the URLs in a form that can be clicked or copied and pasted, you may not need this… but if you can’t afford to require users to fill out a CAPTCHA every time they post, the Unicode people have developed what’s known as the TR39 Skeleton Algorithm for Unicode Confusables.
The basic idea is that, with the help of a big table, people can implement the algorithm for your language of choice (and have done so… usually under some variant of the name “confusables”. The PHP standard library includes one named Spoofchecker) and you can then go skeleton(string_1) == skeleton(string_2)
to compare them without the obfuscation.
That said, it’s not quite that simple. The skeleton algorithm intentionally does not duplicate the process of normalizing uppercase vs. lowercase or ignoring combining characters, so you’ll need to do those first as preprocessing steps.
While I haven’t exhaustively tested it, my intuition is that this is the best way to skeletonize your text for spam detection:
- Normalize to NFKD and strip combining characters. (Eevee’s The Dark Corners of Unicode has a Python example and explains why you normally don’t want to do this, but the same issues apply to the TR39 skeleton algorithm itself, so it should be fine here.)
- Lowercase/uppercase the strings to be skeletonized (Do this after normalizing in case there exist precomposed glyphs with no alternative-case forms in the locale you’re operating under)
- Strip out all whitespace characters (To prevent things like “m a k e m o n e y a t h o m e” and remove hidden spaces such as zero-width joiners)
- Run the TR39 skeleton algorithm on both strings.
Your strings should now be ready for use as input to whatever system you want to use to assess the probability of spam. (Check out this StackOverflow question if you want to train your own classifier and don’t have a spam corpus handy.)