A Better Linkifying Regex

From time to time, I run across situations where the linkifying Greasemonkey script I use mistakenly includes a closing parenthesis in what it considers to be a URL.

Given that I can’t remember a single situation where I needed to linkify a URL with nested unescaped parentheses but URLs inside parentheses have bitten me repeatedly, I decided to solve the problem in a way that’ll work with any regex grammar.

const urlRegex = /\b((ht|f)tps?:\/\/[^\s+\"\<\>()]+(\([^\s+\"\<\>()]*\)[^\s+\"\<\>()]*)*)/ig;

Basically, it matches:

  1. http, https, ftp, or ftps followed by ://
  2. an alternating sequence of “stuff” and balanced pairs of parentheses containing “stuff”

…where “stuff” to refers to a sequence of zero or more non-whitespace, non-parenthesis characters (and, in this linkify.user.js version, non-caret, non-double-quote too).

Embarassingly, aside from two corrections and a few extra characters in the blacklists that I kept from the original linkify.user.js regex, this is a direct translation of something I wrote for http://ssokolow.com/scripts/ years ago… I’d just never remembered the problem in a situation where I could spare the time and willpower to do something about it.

Here’s the corrected Python original.

hyperlinkable_url_re = re.compile(r"""((?:ht|f)tps?://[^\s()]+(?:\([^\s()]*\)[^\s()]*)*)""", re.IGNORECASE | re.UNICODE)

The corrections made were:

  1. Allow the pairs of literal parentheses to be empty
  2. Move a grouping parenthesis so that a(b)c(d)e will be matched as readily as a(b)(c)d.

Markdown source code works especially well to demonstrate the difference.

Naive Linkifying Regex My Linkifying Regex
[FreeDOS](http://freedos.org). [FreeDOS](http://freedos.org).

Theoretically, look-ahead/behind assertions are enough of an extension to regexp syntax to allow real HTML parsing, so I could probably also support nested parens, but I’m just not in the mood to self-nerd-snipe right now.

CC BY-SA 4.0 A Better Linkifying Regex by Stephan Sokolow is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This entry was posted in Geek Stuff. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

By submitting a comment here you grant this site a perpetual license to reproduce your words and name/web site in attribution under the same terms as the associated post.       Also, please be aware that non-constructive comments will have their URL field erased before being approved in order to combat SEO spam.