Splitting CamelCase without Regex lookahead/lookbehind

I’ve been working to port the brains of my game launcher experiment from Python to Rust for easier refactoring and to eventually offer language bindings other than Python itself and, at the moment, I’m working on the “filename to title” heuristic, because that’s got the best unit test suite.

Surprise, surprise. The biggest time-sink has been redesigning things where the Python version uses regular expressions with lookbehind and lookahead expressions.

Well, the CamelCase-processing regular expression faked me out. If you search for answers, it seems every language is using some variation of this regex:

r'((?<=[a-z])[A-Z]|(?<!\A)[A-Z](?=[a-z]))'

…and, since Rust’s Regex crate uses a DFA to avoid pathological performance, this wasn’t an option.

I wasted an afternoon trying to implement such a parser manually before it occurred to me that you don’t actually need lookahead/lookbehind to parse camelcase!

It seems that someone originally wrote that regex with the goal of getting accurate string indexes for the starts of words, then everyone else copy-pasted and translated it without recognizing that it was stricter than necessary.

UPDATE: There is one situation where the lack of lookahead has to be worked around which I somehow overlooked… any situation where you need to insert a space before and after the same character. (ie. words like “I” and “A” in the middle of a phrase.) The solution there is just to run the regex twice.

So, if you just want to insert spaces (without duplicating ones that are already there), and you want to enhance the algorithm to handle numbers and ampersands reasonably and you want it to work in Rust’s regex engine, here’s how you do it:

r"([a-z&])([&A-Z0-9])|([^ ])([A-Z][a-z])|([0-9])([a-z])"

…and then use this as your replacement string:

"${1}${3}${5} ${2}${4}${6}"

It passes every test in my unit test suite for CamelCase splitting and improves the accuracy score on the integration test corpus for the filename_to_name function which relies on it.

UPDATE #2: I got to playing with the Regex crate’s support for unicode character classes and I have two things to report:

First, you actually do need a fourth pair of capture groups to split “A&b” because of the “ampersand, followed by lowercase” situation.

Second, for anyone who wants it, here’s a fully commented Rust regexp which will process CamelCase written using any mix of what the Unicode database considers to be letters (lowercase, uppercase, or titlecase), numbers (decimal digits, letters, or other), and the four different kinds of ampersands I was able to find:

r"(?x)
 # == Ampersand (definitely end) followed by anything not already whitespace ==
 # ('Ampersand' or 'Small Ampersand', or 'Fullwidth Ampersand' or...
 # 'Heavy Ampersand Ornament')
 ([\x{26}\x{FE60}\x{FF06}\x{1F674}])
 # (Something not 'Separator, Space')
 (\P{Zs})

 # == OR ==
 |

 # == Lower/titlecase (possible end) followed by upper/titlecase/number (possible start) ==
 # ('Letter, Lowercase' or 'Letter, Titlecase', 'Ampersand' or 'Small Ampersand', or...
 # 'Fullwidth Ampersand' or 'Heavy Ampersand Ornament')
 ([\p{Ll}\p{Lt}])
 # ('Letter, Uppercase' or 'Number, Decimal Digit' or 'Number, Letter' or...
 # 'Number, Other' or 'Ampersand' or 'Small Ampersand' or...
 # 'Fullwidth Ampersand' or 'Heavy Ampersand Ornament')
 ([\p{Lu}\p{Lt}\p{Nd}\p{Nl}\p{No}])

 # == OR ==
 |

 # == Number followed by an un-capitalized word ==
 # ('Number, Decimal Digit' or 'Number, Letter' or 'Number, Other')
 ([\p{Nd}\p{Nl}\p{No}])
 # ('Letter, Lowercase')
 (\p{Ll})

 # == OR ==
 |

 # == Anything not whitespace, followed by ampersand or unambiguous beginnings of word ==
 # (Something not 'Separator, Space')
 (\P{Zs})
 # ('Letter, Titlecase' or ['Letter, Uppercase' followed by 'Letter, Lowercase'] or...
 # 'Ampersand' or 'Small Ampersand' or 'Fullwidth Ampersand' or ...
 # 'Heavy Ampersand Ornament')
 (\p{Lt} | \p{Lu}\p{Ll} | [\x{26}\x{FE60}\x{FF06}\x{1F674}])
 "

…and…

"${1}${3}${5}${7} ${2}${4}${6}${8}"

Before you get scared, it’s actually just the Unicode version of the tiny regexp with a ton of explanatory comments and a little more ampersand matching. It’s very simple once you throw out the comments and understand the structure.

The regular expression matches any of four situations where a space should be inserted:

  • An ampersand followed by anything that isn’t whitespace (this is the one I split out)
  • A lowercase or titlecase (upper+lower in one character) letter followed by an uppercase/titlecase letter or a number.
  • A number followed by an un-capitalized letter.
  • Anything not whitespace, followed by an ampersand, titlecase letter, or <UPPER><lower> pair.

Once you ignore the comments, most of the verbosity comes from having to combine multiple unicode character literals or classes to say things like…

  • “any ampersand” ([\x{26}\x{FE60}\x{FF06}\x{1F674}])
  • “any type of number” ([\p{Nd}\p{Nl}\p{No}])

The advantage is that, because it delegates the lower-level details to the Unicode lookup tables for everything but the ampersands, any failure should result from the language being so alien that the concept of CamelCase itself is difficult to apply to it.

UPDATE #3: Replace chained OR tokens with compound character classes where possible for minor cache pressure savings in the regexp engine.

It’s still a heavy regexp (I was chatting with burntsushi on IRC when the idea occurred to me and, if anyone should have the expertise to judge that, it’s the author of the Regex crate), but it meets my needs, it’s easy to tweak as I further refine my heuristic guesser, and if I ever need something more performant, writing this has taught me enough about the problem that I should be able to write a manual solution.

CC BY-SA 4.0 Splitting CamelCase without Regex lookahead/lookbehind by Stephan Sokolow is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This entry was posted in Geek Stuff. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

By submitting a comment here you grant this site a perpetual license to reproduce your words and name/web site in attribution under the same terms as the associated post.

All comments are moderated. If your comment is generic enough to apply to any post, it will be assumed to be spam. Borderline comments will have their URL field erased before being approved.