A Compromise Between Substring and Prefix Matching

A.K.A.: How to write what human intuition actually expects substring matching to be

While the changes aren’t yet ready to be pushed, I’ve been working on one of my hobby projects quite a bit over the last few days and I just thought I’d share a little something I stumbled upon while implementing a result filter box.

Systems with advanced string searching will often let you choose between prefix or substring matching, but I’ve found that both have glaring flaws when you’re implementing something like a “find as you type” launcher, where the goal is a fast match that’s “good enough”.

With substring matching, you quickly realize that computers are much better than humans at finding substrings in the darndest of places, making substring matching very counter-intuitive. (I get the impression that it has to do with humans thinking in syllables while computers don’t, so it’d be interesting to see how the effect changes in non-alphabetic writing systems, like Kanji or Hangul.)

By contrast, prefix matching is often overly specific and ill-suited to situations where many titles may begin with the same article (A, The, etc.) or the name of a series with many entries. Unfortunately, splitting off the articles, then moving them to the end, as Steam does, also has the potential to trip people up, so there’s no perfect solution.

The solution I developed, almost by accident, is essentially a half-way point between prefix matching and the full-blown keyword-based approach a search engine takes:

Use case-insensitive matching and require that substring matches begin at a word boundary.

This has the following desirable characteristics for a find-as-you-type solution:

  • It minimizes the need to press modifier keys, which require costly muscle synchronization:
    • It’s case-insensitive
    • There’s no need for users to quote literals to avoid them being reordered as would be necessary with a full-blown keyword search grammar (ie. “pirates of” won’t match “of pirates”)
  • It’s robust against variations in title formatting:
    • A search for “bri” will match both “The Bridge” and “Bridge, The” without also returning spurious results like “Abrix the robot”.
    • A search for “pir” will return “Space Quest III: The Pirates of Pestulon” without concern for how many Space Quest games sort earlier in the results, whether the title was transcribed using “3” or “III”, or “]|[“, whether the subtitle begins with “The”, or whether the separator is “: ” or ” – “.
  • It lacks the over-broadness that you find with substring matching, where “pir” will match “Drascula: The Vampire Strikes Back” and “Spirits”.

It’s also simple to implement:

  • For typical regexp searching, just prepend \b to the pattern and set the case-insensitive flag. (If your engine lacks \b, then use (^|\s) instead.)
  • For literal string matching on top of a regexp engine, just escape the pattern and follow my instructions for a regexp search.
  • For CMD.EXE-style wildcard matching, escape the pattern, then replace \? with . and \* with .* before prepending the \b.
  • For a manual implementation of literal-string matching on titles with normalized whitespace, just check whether it matches at the beginning (eg. title.lower().startswith(pattern.lower())) and then prepend a space and search within. (eg. (title.lower().index(' ' + pattern.lower())) >= 0)

UPDATE 2016-10-02: The \b word boundary token doesn’t consider parentheses to be part of a word, which I’ve found to be a confusing surprise in day-to-day use, so you’ll want to use (^|\b|\s) instead of \b. This will allow both “(Eng” and “Eng” to match “(English)” in typical usage for maximum intuitiveness.

In case you want to play around with this, here’s a quick sampling of how to regex-escape a string in various popular environments:

Posted in Geek Stuff | Leave a comment

Using OpenCV to automatically skip recurring post-roll ads

TL;DR: Install OpenCV-Python, download this script and follow the instructions in the script’s --help output.

While I like The Young Turks, they’ve recently started adding the same two or three carnival barker-esque appeals for subscribers to the end of all of their videos. That gets very annoying very quickly.

Since I don’t believe in rewarding bad behaviour (like forcing avid viewers to see the same couple of annoying ads a million times), I refuse to let them nag me into being a member. However, I still need something to occupy my mind while doing boring tasks, so I needed a solution.

As such, here’s a Python OpenCV script which will find the time offset of the first last occurrence of a frame in a video file (eg. a screenshot of the TYT title card that appears between the content and the ad) in a video file and then write an MPV EDL (Edit Decision List) file which will play only the portion of the video prior to the matched frame.

UPDATE: Hint: Put this script, your videos, and one or more screenshots (to be matched in a fallback chain, sorted by name) into the same folder and you can just double-click it.

I’ve also done the preliminary research to fuzzy-match the audio of those two or three naggy bits in case they decide to try to render this ineffective by moving the title card to the very end… partly because it would also give a more accurate cut point if used with the current clips.

(As is, I tend to lose the last 500 to 1500 milliseconds of actual content due to variations in how how they cut the pieces of each clip together… but, even if I lost an entire clip every now and then, it’d be an acceptable sacrifice to avoid those annoying nags. Current clips are cut together such that stopping at the last frame of the end-title card removes the nag perfectly.)

Posted in Geek Stuff | Leave a comment

“Gypsy Bard” and the My Little Pony Fandom’s Creative Output

As I’ve mentioned before, I’ve a certain fondness for throwing characters into interesting situations to see what makes them tick and, as I’ve also mentioned before, I got lured into the My Little Pony fandom by the wide selection of catchy fan-created music. I’ve decided I want to comment further on that.

While waiting for some files to transfer, I found myself reading a well-written self-insert fanfic (Damn you, recommendations sidebar! You always know just what’ll hook me next! ;P ) with an interesting plot point:

When our hero discovers that this isn’t just any My Little Pony setting, but, rather, that she’s “stuck in a snow globe” made of her own hypothetical musings, she tries to drink away the resulting burst of existential despair and winds up singing a song. There’s more to it, but I want to focus on the song:

Gypsy Bard Cover by Dreamchan feat. Princewhateverer

Ignoring the benefits of this kind of cross-referencing being common, let’s look at the context in which this song exists.

In addition to Dreamchan’s cover, this same song also has a remix by The Living Tombstone (with impressive visual accompaniment by olibacon), a cover with piano accompaniment by Flutterwhat, an orchestral-styled instrumental version by BassBeastDJ, and various other ones that don’t stand out from the crowd as clearly like this 8-bit cover.

Now, this isn’t unheard of. In fact, having a large constellation of covers, remixes, and the like, seems to be becoming more common. For example, see this cover and many others for the song Megalovania from Toby Fox’s Undertale or the many “Abridged Series” comedy dub edits. (Here’s an example clip with non-worksafe language taken from one of them.)

It’s not even unusual for a fandom to produce fanworks that go beyond ordinary covering and/or remixing. For example, here are some songs from the fans of Starcraft, Deus Ex, Portal ([1], [2]), Star Wars Galaxies, Mass Effect 2, and Battlefield 1.

What makes MLP:FiM noteworthy is how much the fandom has been producing fanworks that go beyond merely remixing existing content. For example, the Deus Ex and Mass Effect 2 songs I linked were both by the same artist, Miracle of Sound, while the MLP fandom alone has artists like ponyphonic, WoodenToaster, and StormWolf, each having produced multiple songs.

The fans were even working on a fighting game, called Fighting is Magic, before Hasbro decided that went too far and sent them a Cease and Desist notice.

What makes Gypsy Bard so special is that it’s an original song written for episode 7 of an “Abridged Series” (I quote it because the creators consider it too divergent for that term to fit well) called “Friendship is Witchcraft”.

Why is that so important? Because of what it implies about the size and vitality of the fan community. Not only did they get together enough people to voice-act an Abridged Series, they also managed to find someone to write original songs. Then, on top of that, one of those songs inspired its own covers and remixes… one of which (Dreamchan’s) would have been based on Flutterwhat’s piano version if Princewhateverer hadn’t contributed a guitar version. In other words, it’s a third-order fan-work at minimum. (A fan-work of a fan-work of a fan-work… not counting indirect inspirations) Those are not common. In fact, until My Little Pony, the most I’d ever seen was a second-order fan-work.

However, to be fair, with music, the main barrier to high-order fanworks is getting attention. What I originally saw were second-order fanfics, which require readers to be familiar with the original work first. (For example, a Harry Potter fanfic named Make a Wish was popular enough for other people to write fanfics in the universe it established.)

A closer musical analogue would be The Mirror Lies, a heavy metal song written as a tie-in to a My Little Pony fanfic named A Change of Face.

Nonetheless, I still maintain that the fandom for My Little Pony: Friendship is Magic has attracted an uncommon mix of quality and quantity when it comes to creative and skilled fans. Let’s just hope that it’s merely ahead of the curve and this kind of output will become increasingly common.

UPDATE 2021-01-12: I haven’t kept up on things enough to be sure, but it looks like it’s probably a bit of both. I haven’t noticed anyone else in fanfiction as prolific as the bronies were at their peak, but we are also seeing a rise in content overall as well.

An example of a comparable outlier on the gaming side would probably be Minecraft with songs like Diggy Diggy Hole and MoonQuest for Yogscast (the former now covered by Wind Rose), In Search of Diamonds by Eric Fullerton, parodies like TNT and Revenge by CaptainSparklez under his own name, etc.

Posted in Web Wandering & Opinion | Leave a comment

Chickasaw Mountain by Leslie Fish

I’d like to take a few minutes to wax poetic on a song I just recently discovered:

Chickasaw Mountain by Leslie Fish (lyrics, buy)

There are so many reasons I love the song. However, I’ll focus on the lyrics since I’m not very good at explaining why I love music aside from contributing factors like “it sounds celtic” and “it incorporates violins”.

I’ll start by focusing on the most obvious layer of the lyrics:

It’s a folk ballad, where the singer tells of the Faustian bargain a friend made with a being known as the Lady of the Morning Star. I love the impression this layer of meaning gives.

At first, it makes various references to deals with The Devil, with lines like Call Her Lady of the Morning Star and making it his sister’s apple tree that he winds up hanging on.

…but, at the same time, it makes it very clear that this is NOT Lucifer, with the lines Name your goal; She won’t ask your soul and this passage:

Seek no level of God or Devil
She’s something older by far
Call Her Lady of the Morning Star

The overall impression is that The Lady of the Morning Star is some kind of primordial Fae-like being: An immortal with incomprehensible motivations, who views mortals as toys for her amusement and is so ancient that, if Lucifer does exist in this setting, he was likely named in reference to her.

The Fae feeling is further reinforced by the last line Any wild place on Earth will do!

(It’s a common theme that the Fae and other such creatures shy away from civilization and dwell in “the wild and untamed places of the world”.)

On a purely emotional level, this layer of meaning is all I care about and I can’t get over how much I love it.

However, on an intellectual level, there’s still more to come. Let’s move on to the second layer of meaning:

According to the subtitle, the song is a tribute to Phil Ochs.

Seen in this light, a second complete set of references emerge.

For example, these lines:

She offers two bargains; the price is deep and dark
One takes your life and the other leaves a mark

…and these lines…

Whoever has wisdom can guess what lies unsaid
The cost of the gift to the living and the dead
Still if you feel you’ll gain from the deal
You’ll play with the old Morning Star

If that’s not a metaphor for the “live fast and die or burn out young” pattern that takes so many great artists, I don’t know what is.

The rest of the song follows Mr. Ochs’s rise and fall closely, with phrases like these:

Made him the best of his generation
Sang till the end of the war
And not a moment more.

…which reference his status as one of the biggest names among Vietnam War protest singers and his subsequent descent into mental illness immediately thereafter… finally ending with the lyric Hanging on his sister’s apple tree, a reference to how, less than a month before the one-year anniversary of the end of the Vietnam War, Phil Ochs committed suicide by hanging while living with his sister.

Posted in Web Wandering & Opinion | 2 Comments

Working around Pidgin’s mis-designed certificate error dialog

The Pidgin developers apparently haven’t thought things through very well when it comes to TLS/SSL support because, if you want to connect to a network which uses a self-signed cert, they’ll present you with a permission dialog every time you connect (no “remember” option) and, last time I reported this, they considered it “WONTFIX: self-signed certs are a bug”.

They seem to think that it will force network operators to get proper certificates but, in reality, they don’t have that kind of leverage when every other IRC client allows you to ignore cert errors, so it just forces people to either turn off SSL or desensitizes them to the warning dialogs.

This is especially problematic for me because one of the networks I connect to is encrypted-only, managed by someone who doesn’t trust Let’s Encrypt and, if the Pidgin SSL handshake times out, it remembers the failure rather than re-displaying the prompt on reconnect. Because I have no idea how long the timeout is but it always seems too short, that trained me to punch “allow” on the annoying dialog as quickly as possible without wasting time reading the prompt… never a good sign.

So, today, I’m going to teach you how to get the best of both worlds: How to use stunnel to trick Pidgin into using self-signed SSL without complaining. (And, as a bonus, stunnel makes it easy to verify a self-signed cert without adding it to the system-wide cert store, so it can actually be more secure than the certs the Pidgin developers want you to use.)

First, install stunnel and set it to run on startup. On Debian-based distros like Ubuntu and Mint, this is as simple as running sudo apt-get install stunnel and then setting ENABLED=1 in /etc/default/stunnel4.

Next, we need to write a config file so that, when your client connects to a stunnel server on localhost without encryption, it will make an encrypted connection to the IRC server in question.

The key lines are as follows:

; Maximize security
; NOTE: See the manpage or sample config file for implications
chroot = /var/lib/stunnel4/
setuid = stunnel4
setgid = stunnel4

; Needed for stunnel to work properly
; (Prefix the contents of the chroot line if not using chroot)
pid = /stunnel4.pid

; Disable support for insecure SSLv2 protocol
options = NO_SSLv2

; Define the actual proxy service
[irc-your-network]
client = yes
accept = 127.0.0.1:6612
connect = irc.your-network.com:6697

Now, at this point, you’ve matched what you got from putting up with Pidgin’s security dialog:

  1. Put those lines into /etc/stunnel/whatever_you_want.conf
  2. Start stunnel (sudo /etc/init.d/stunnel4 start if you’re on Ubuntu 14.04 LTS or older)
  3. Set Pidgin to connect to the address in the accept line.

Pidgin will think it’s connecting to an un-encrypted IRC server and the connection will be encrypted between stunnel and the server.

However, we can do one better. If we can get the server certificate in PEM format, we can have stunnel verify it, preventing man-in-the-middle attacks.

The ideal solution would be to download the PEM file through a trusted channel but, as a stop-gap, let’s replicate the “trust whatever we see first” behaviour that SSH uses. Fill in your IRC server’s details in the following command and run it to dump the server cert:

openssl s_client -showcerts -connect irc.your-network.com:6697 </dev/null 2>/dev/null|openssl x509 -outform PEM >your-irc-network.pem

Now, copy the resulting your-irc-network.pem file into /etc/stunnel/ and add the following lines to your whatever_you_want.conf file:

verify=3
CAfile = /etc/stunnel/your-irc-network.pem

You’ll probably also want to add these lines temporarily so you can see what’s going wrong if the verification fails:

debug = 7
output = /stunnel.log

…and, that’s it. Just restart stunnel, reconnect with your IRC client, and, barring verification errors, you should have an encrypted connection which verifies the self-signed certificate.

(I say “should” because, as of this writing, the self-signed cert I’m testing against is expired, so I can’t get all the way through the verification process to confirm.)

Posted in Geek Stuff | Leave a comment

Fixing “Steam refuses to start (without error message) under VirtualBox”

TL;DR:

  1. Disable 3D acceleration for the guest
  2. export CPU_MHZ=2000
  3. find ~/.steam/steam/ubuntu12_32/steam-runtime/ -iname '*libstdc++.so*' -execdir mv {} {}.bak \;

Since GOG has introduced GOG Connect, I decided I might as well try to turn some worse-than-worthless unredeemed Steam keys into actual GOG-owned games.

Now, given that I don’t trust the Steam client, that means quarantining it inside a VirtualBox VM that can see the public Internet, but not my LAN.

Unfortunately, whoever wrote it most definitely doesn’t code defensively (It’s like driving defensively. Assume Murphy’s Law and code accordingly.), because the /usr/bin/steam wrapper ignores common options like --help and -v and, in my VM, it exits without showing an error message and without actually starting up.

In order to track down what was going wrong, I had to rely on my skills as a developer:

  1. Run file "`which steam`" to verify that I’m dealing with a wrapper script
  2. Run bash -x "`which steam`" to get a readout of what’s actually being executed.
  3. Run bash -x /home/user/.local/share/Steam/steam.sh to see what’s happening in the actual wrapper.
  4. Finally see the “Unable to determine CPU Frequency. Try defining CPU_MHZ.” message.
  5. Run export CPU_MHZ=2000; steam and then wait for it to download its updates.
  6. Wait several minutes after it stops producing output while it appears to do nothing, then Ctrl+C out of it.
  7. Try previous steps again and discover an error message getting covered up by the outer wrapper.
  8. Search up results for said error and discover that none of the suggested commands reference paths used in this version of Steam.
  9. Manually craft a new version of them: find .steam/steam/ubuntu12_32/steam-runtime/ -iname '*libstdc++.so*' -execdir mv {} {}.bak \;
  10. Finally get a helpful error message, advising me to turn off VirtualBox’s 3D acceleration to make Steam work.
  11. Restart the VM with 3D acceleration disabled and try re-setting CPU_MHZ and re-running steam.
Posted in Geek Stuff | Leave a comment

Quick Hack: Global Hotkeys for SMPlayer

I have a family member who has grown used to pausing Audacious Media Player with a quick press on their XF86AudioPlay key, but half of what they listen to is video files, so I wrote a quick shell script to work around SMPlayer‘s lack of global hotkey support.

Instructions:

  1. Install xdotool (on Debian/Ubuntu/Mint)
  2. Put the script somewhere and set it executable
  3. Use your desktop’s global hotkey support or a utility like xbindkeys to run the script whenever something like Ctrl+XF86AudioPlay is pressed.

If you want it to do something other than play/pause, the simplest solution is probably to replace space with "$@"  (including quotes) inside the script, save it under a name like smplayer_remote.sh, and then set up associations like these:

  • XF86AudioPlaysmplayer_remote.sh space
  • XF86AudioRewindsmplayer_remote.sh Left
  • XF86AudioForwardsmplayer_remote.sh Right

(You can find the keysym names using the xev tool (on Debian/Ubuntu/Mint) and man xdotool will tell you how to specify modifier keys like Ctrl and Alt)

I initially tried to make it behave exactly like a real global hotkey would, but xdotool seems to have an unavoidable race condition and no way to force a 100ms delay after using the windowfocus command… so, instead, this will momentarily switch window focus, then restore whatever you were working on before.

It should be possible to do it properly if I were willing to either patch xdotool or drop down and write against the bare X11 APIs, but I just don’t have the time for either and this is Good Enough™ for the intended use case.

Posted in Geek Stuff | 3 Comments