From time to time, I run across situations where the linkifying Greasemonkey script I use mistakenly includes a closing parenthesis in what it considers to be a URL.
Given that I can’t remember a single situation where I needed to linkify a URL with nested unescaped parentheses but URLs inside parentheses have bitten me repeatedly, I decided to solve the problem in a way that’ll work with any regex grammar.
const urlRegex = /\b((ht|f)tps?:\/\/[^\s+\"\<\>()]+(\([^\s+\"\<\>()]*\)[^\s+\"\<\>()]*)*)/ig;
Basically, it matches:
- an alternating sequence of “stuff” and balanced pairs of parentheses containing “stuff”
…where “stuff” to refers to a sequence of zero or more non-whitespace, non-parenthesis characters (and, in this
linkify.user.js version, non-caret, non-double-quote too).
Embarassingly, aside from two corrections and a few extra characters in the blacklists that I kept from the original
linkify.user.js regex, this is a direct translation of something I wrote for http://ssokolow.com/scripts/ years ago… I’d just never remembered the problem in a situation where I could spare the time and willpower to do something about it.
Here’s the corrected Python original.
hyperlinkable_url_re = re.compile(r"""((?:ht|f)tps?://[^\s()]+(?:\([^\s()]*\)[^\s()]*)*)""", re.IGNORECASE | re.UNICODE)
The corrections made were:
- Allow the pairs of literal parentheses to be empty
- Move a grouping parenthesis so that
a(b)c(d)ewill be matched as readily as
Markdown source code works especially well to demonstrate the difference.
|Naive Linkifying Regex||My Linkifying Regex|
Theoretically, look-ahead/behind assertions are enough of an extension to regexp syntax to allow real HTML parsing, so I could probably also support nested parens, but I’m just not in the mood to self-nerd-snipe right now.