A Better Linkifying Regex

From time to time, I run across situations where the linkifying Greasemonkey script I use mistakenly includes a closing parenthesis in what it considers to be a URL.

Given that I can’t remember a single situation where I needed to linkify a URL with nested unescaped parentheses but URLs inside parentheses have bitten me repeatedly, I decided to solve the problem in a way that’ll work with any regex grammar.

const urlRegex = /\b((ht|f)tps?:\/\/[^\s+\"\<\>()]+(\([^\s+\"\<\>()]*\)[^\s+\"\<\>()]*)*)/ig;

Basically, it matches:

  1. http, https, ftp, or ftps followed by ://
  2. an alternating sequence of “stuff” and balanced pairs of parentheses containing “stuff”

…where “stuff” to refers to a sequence of zero or more non-whitespace, non-parenthesis characters (and, in this linkify.user.js version, non-caret, non-double-quote too).

Embarassingly, aside from two corrections and a few extra characters in the blacklists that I kept from the original linkify.user.js regex, this is a direct translation of something I wrote for http://ssokolow.com/scripts/ years ago… I’d just never remembered the problem in a situation where I could spare the time and willpower to do something about it.

Here’s the corrected Python original.

hyperlinkable_url_re = re.compile(r"""((?:ht|f)tps?://[^\s()]+(?:\([^\s()]*\)[^\s()]*)*)""", re.IGNORECASE | re.UNICODE)

The corrections made were:

  1. Allow the pairs of literal parentheses to be empty
  2. Move a grouping parenthesis so that a(b)c(d)e will be matched as readily as a(b)(c)d.

Markdown source code works especially well to demonstrate the difference.

Naive Linkifying Regex My Linkifying Regex
[FreeDOS](http://freedos.org). [FreeDOS](http://freedos.org).

Theoretically, look-ahead/behind assertions are enough of an extension to regexp syntax to allow real HTML parsing, so I could probably also support nested parens, but I’m just not in the mood to self-nerd-snipe right now.

CC BY-SA 4.0 A Better Linkifying Regex by Stephan Sokolow is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This entry was posted in Geek Stuff. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

By submitting a comment here you grant this site a perpetual license to reproduce your words and name/web site in attribution under the same terms as the associated post.

All comments are moderated. If your comment is generic enough to apply to any post, it will be assumed to be spam. Borderline comments will have their URL field erased before being approved.