A Compromise Between Substring and Prefix Matching

A.K.A.: How to write what human intuition actually expects substring matching to be

While the changes aren’t yet ready to be pushed, I’ve been working on one of my hobby projects quite a bit over the last few days and I just thought I’d share a little something I stumbled upon while implementing a result filter box.

Systems with advanced string searching will often let you choose between prefix or substring matching, but I’ve found that both have glaring flaws when you’re implementing something like a “find as you type” launcher, where the goal is a fast match that’s “good enough”.

With substring matching, you quickly realize that computers are much better than humans at finding substrings in the darndest of places, making substring matching very counter-intuitive. (I get the impression that it has to do with humans thinking in syllables while computers don’t, so it’d be interesting to see how the effect changes in non-alphabetic writing systems, like Kanji or Hangul.)

By contrast, prefix matching is often overly specific and ill-suited to situations where many titles may begin with the same article (A, The, etc.) or the name of a series with many entries. Unfortunately, splitting off the articles, then moving them to the end, as Steam does, also has the potential to trip people up, so there’s no perfect solution.

The solution I developed, almost by accident, is essentially a half-way point between prefix matching and the full-blown keyword-based approach a search engine takes:

Use case-insensitive matching and require that substring matches begin at a word boundary.

This has the following desirable characteristics for a find-as-you-type solution:

  • It minimizes the need to press modifier keys, which require costly muscle synchronization:
    • It’s case-insensitive
    • There’s no need for users to quote literals to avoid them being reordered as would be necessary with a full-blown keyword search grammar (ie. “pirates of” won’t match “of pirates”)
  • It’s robust against variations in title formatting:
    • A search for “bri” will match both “The Bridge” and “Bridge, The” without also returning spurious results like “Abrix the robot”.
    • A search for “pir” will return “Space Quest III: The Pirates of Pestulon” without concern for how many Space Quest games sort earlier in the results, whether the title was transcribed using “3” or “III”, or “]|[“, whether the subtitle begins with “The”, or whether the separator is “: ” or ” – “.
  • It lacks the over-broadness that you find with substring matching, where “pir” will match “Drascula: The Vampire Strikes Back” and “Spirits”.

It’s also simple to implement:

  • For typical regexp searching, just prepend \b to the pattern and set the case-insensitive flag. (If your engine lacks \b, then use (^|\s) instead.)
  • For literal string matching on top of a regexp engine, just escape the pattern and follow my instructions for a regexp search.
  • For CMD.EXE-style wildcard matching, escape the pattern, then replace \? with . and \* with .* before prepending the \b.
  • For a manual implementation of literal-string matching on titles with normalized whitespace, just check whether it matches at the beginning (eg. title.lower().startswith(pattern.lower())) and then prepend a space and search within. (eg. (title.lower().index(' ' + pattern.lower())) >= 0)

UPDATE 2016-10-02: The \b word boundary token doesn’t consider parentheses to be part of a word, which I’ve found to be a confusing surprise in day-to-day use, so you’ll want to use (^|\b|\s) instead of \b. This will allow both “(Eng” and “Eng” to match “(English)” in typical usage for maximum intuitiveness.

In case you want to play around with this, here’s a quick sampling of how to regex-escape a string in various popular environments:

CC BY-SA 4.0 A Compromise Between Substring and Prefix Matching by Stephan Sokolow is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This entry was posted in Geek Stuff. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

By submitting a comment here you grant this site a perpetual license to reproduce your words and name/web site in attribution under the same terms as the associated post.

All comments are moderated. If your comment is generic enough to apply to any post, it will be assumed to be spam. Borderline comments will have their URL field erased before being approved.