A more formal way to think about validity of input data

I’ve begun to port one of my hobby projects from Python to Rust and, while setting up the clap argument parser, I found myself having to bind to the access(2) libc function myself.

Yes, it exposes you to a race condition exploit if you’re not careful, because the permissions could change between checking and depending on them. Yes, it’s a documented fact that it may be more permissive than actually attempting to access the filesystem. (I believe the situation I’m remembering was “access() doesn’t consider ACLs when evaluating permissions”) …but how else am I to implement a “fail early” check for “Can I create files in this directory?” when there exist real in-the-wild examples of filesystems (eg. AFS) having been configured to allow the creation of a hypothetical test file, but not the subsequent deletion?

That said, despite my intent to use Rust to ensure I handle every recoverable error case, there’s still a certain appeal to being able to point to a spot and say “beyond this point, this piece of data is trustworthy”.

Thinking about this made me realize a nice, simple way to think about handling input data. By analogy to passing by value (with deep copying) or by reference.

NOTE: While my examples will all use command-line arguments, this applies to any kind of input data.

Value Arguments

If a command-line argument cannot become invalid after being validated, then it’s a value argument. Examples of this include:

  • Boolean flags like “mirror this print job”
  • Integers representing things like the number of copies of a document to print
  • Strings which can’t experience any kind of namespace collision

You can validate value arguments once and then trust that they’ll stay valid.

Reference Arguments

If an argument depends on something outside your control to determine its validity, then a validity check only applies to the instant you perform it. Common examples of “reference arguments” include:

  • Filesystem paths (Between the check and use, permissions could change, a creation/deletion/rename could invalidate the path, etc.)
  • File descriptors (Even a supposedly local file descriptor could be on a network-mounted drive which goes away)
  • Strings used to create filenames (someone could create a file with that name which you lack the permissions to manipulate)
  • Network addresses
  • Cached results of arbitrary checks

This means that you need to be prepared for the unexpected every time you use a reference argument and you can only check separately from using them if the following conditions are met:

  1. The check has no security implications and can be safely removed
  2. You accept that the check could fail but the attempt could still succeed
  3. You accept that the check could succeed but the attempt could still fail


Argument Type Why?
Boolean Value Nothing external to the program will invalidate this.

(The only way this could be a reference is if there were some kind of wrapper which detected the orientation of pre-punched cardstock in the printer and then did or didn’t pass this flag. The user could invalidate it by flipping/rotating the card stock before the print job actually begins.)

Boolean Reference The flag implies that either the user or the code detected a rewritable CD/DVD, but the user could swap in a non-rewritable disc before it actually gets used if the script does something long-running first, like generating an ISO in /tmp

Because you can only erase a rewritable disc, this must be validated as late as possible. (ie. After the drive tray has been locked and right before the operation would take place)

 Number of copies to print Integer Value The only relevant detail which can change is how much paper is in the printer, and, if there isn’t enough, the proper solution isn’t to reduce the size of the print job.
 File descriptor Integer Reference The descriptor could be pointing at a resource on a network-attached device that goes away.
Document Title String Either Whether to treat this as a reference depends on where it will end up and how you handle failure.

If you’re converting an eBook with ebook-convert from Calibre, then it’s a value because the output filename is specified separately and whether your title will override the source file’s metadata is not up for debate.

Output Filename String Reference No matter how many times you validate, it’s possible that a read-only file will have taken that name by the time you call open()

The Takeaway

  • Think in terms of how one piece of data depends on another and don’t forget that dependencies can extend outside of your program.
  • Whether a piece of data can be validated once and then trusted is unrelated to its data type or how it’s passed within your code. (You can pass a filename or URL by value but it’s still a reference to an external resource. A network filesystem will subvert your expectations for how reliable it is to hold an open file descriptor. etc. etc. etc.)
  • The definition of “valid” for a piece of data may depend on how your program is intended to be used. (A human might specify a filename and re-run your tool if it’s already taken. From your perspective, that means it’s valid even if it causes the process to abort. A GUI frontend, on the other hand, probably won’t know how to detect that kind of failure and retry. Expose a more foolproof API by using something like mkstemp or mkdtemp and then returning the newly-created path.)
  • Functions like access which check the validity of a reference are unreliable and should only be used to catch obvious mistakes early so the user doesn’t have to waste their time waiting for a failure that could have been anticipated. If it’s unsafe to comment them out, you’re doing it wrong.
    (eg. You can use access to detect read-only target directories before you know the exact output filename… with the caveat that they could be made read-only between the check and the attempt to actually write the file.)
Posted in Geek Stuff | Leave a comment

On Making Steam Machines Successful

TL;DR: Provide a summarized representation of system requirements, make it easier to decide between different models, partner with YouTube and/or NetFlix to make the device more valuable, spin the cost of a Steam Machine as an investment in cheaper per-game costs and long-term compatibility, and appoint/hire a hype management expert.

With Steam Machines, Valve is quite possibly the first company to have a viable idea for a non-traditional gaming console. However, there are still several ways in which they don’t seem to be learning from history.

One of the greastest strengths consoles have always had (and, with PCs taking the lead on hardware innovation, their main strength) is their appliance-like simplicity. Conversely, the greatest weakness of personal computers is that they do an inherently complex job and attempts to reduce them to mere appliances have always crippled them to the point of irrelevance.

However, as the millennial generation grew up with computers, the definition of an acceptably simple console shifted closer and closer to what Steam now provides, growing a simplified operating system, game browser, online store, and a menu analogous to the Steam overlay.

While I’m not a fan of online DRM, I can’t help but approve of the money and effort Valve has been pouring into getting people to make Linux builds of games (making a build without Steamworks CEG for sites like GOG and Humble is easy once you get that far), so I thought I’d point out the main mistakes Valve seems not to be learning from the tales of other “licensed hardware” consoles, like the 3DO, the Philips CDi, and the Nuon.

The Problems

First, price. Without the ability to sell the console as a loss-leader or enjoy the massive economies of scale for a single model, these consoles always lose out on price.

Second, confusion. To put it bluntly, “If I wanted to do research, I’d be a PC gamer”. Picking the right Steam Machine is a serious issue and at odds with the “console-like simplicity” niche that everything else about the Steam Machine has been aiming for.

Valve should also keep in mind that a glut of choice without clear and obvious determinants was one of the big contributing factors to the video game crash of 1983. The difference being that, here, the answer is “Stick to Nintendo/Sony/Microsoft” rather than “Shy away from the entire market”.

Third, inertia. Whenever you look at technologies which failed to live up to their potential, one of the recurring themes is that they fell off the wave of excitement they were building and it died away. Steam Machines have suffered the same problem, which makes future marketing efforts much more difficult.

That said, the biggest problem the 3DO, CDi, and Nuon had was their poor game libraries… something Valve has been doing an excellent job to solve. This is why I firmly believe Steam Machines have a chance.

The Solutions

First and foremost, Valve needs to simplify buying decisions. I strongly suggest the following:

  1. Summarize system requirements into numbered hardware classes and focus promotional efforts on three at a time, representing entry, mid, and high-level hardware:
    • Class 1: Entry Level on release day
    • Class 2: Mid-level on release day, Entry Level when Class 4 is announced
    • Class 3: High-level on release day, Mid-level when Class 4 is announced
    • Class 4: High-level when announced
  2. Set up something like the Windows Logo Program through which hardware partners are approved for “Class 1/2/3” badges to use on their hardware and packaging.
  3. Use class badges to summarize the required and recommended system requirements on each game in the Steam store and add support for filtering by them.
  4. By default, Steam Machines should filter by required class. (Completely. Even front-page promotions which don’t run on the filtered class should be eliminated or collapsed into a “deals for your other devices” bar on Steam machines incapable of playing them.)
  5. Produce a prominent “What Steam Machine Is Right For Me?” comparison matrix based, not on system requirements or features, but on comparing which games will run on which of the three classes currently being promoted. I’d suggest the following three columns with “and more…” hyperlinked to an appropriate catalogue search:
    1. Class 2: <list of popular games> and more…
    2. Class 3: Everything in Class 2, plus <list of popular games> and more…
    3. Class 4: Everything in Classes 3 and 4, plus <list of popular games> and more…

If users start thinking of the classes in terms of “a Class 1 game” rather than “a Class 1 machine”, so much the better for marketing purposes. (It could be leveraged into a tool for evaluating current desktop PC hardware or possibly planned purchases for their suitability to gaming, thus helping Steam as a whole.)

Second, Valve needs to change the conversation about price. When you buy a modern console, everyone knows that you’ll need more than one because, at best, you’ll get one generation of backwards compatibility and you may have to re-buy your games for that.

When you buy a Class 5 Steam Machine, the Steam Runtime guarantees compatibility all the way back to Class 1 and, when you buy a class 8 Steam Machine, you don’t have to re-buy anything. Furthermore, games are never locked to your console the way they are with the Wii Virtual Console.

Also, as everyone on PC knows, Steam sales allow you to build your library much more cheaply than on a regular console.

A Steam Machine is an investment in spending a lot less on the actual games.

Third, Valve needs to build on that “fewer pieces of hardware” angle. If a Steam Machine is maximally backwards compatible, why can’t they also partner with Google and/or NetFlix to include a tweaked copy of Chrome and ChromeOS apps for YouTube and NetFlix? I know for a fact that NetFlix has one.

Failing that, they could pour some effort into Shashlik to get the YouTube and NetFlix Android apps running on SteamOS in a polished way. Hell, if they’re not planning to compete with the Google Play Store, that’d allow them to add “plays select Android games” to the less emphasized portion of the Steam Machine feature list.

(“less emphasized” because it’d probably be tricky to get the go-ahead to preload the Google Play Store app and “select games” because, last I checked, most Android games used native ARM machine code and I’m unsure what Intel would want for their libhoudini ARM-to-x86 emulation layer for Android.)

Finally, Valve needs to manage their marketing better. Allowing the excitement surrounding the Steam Machine to bleed away into “valve time” [2] [3] will cripple any attempt to break into an existing market. Find someone who knows how to walk the hype tightrope and listen to their advice very closely.

However, on this front, Valve actually has an advantage that the the 3D0, CDi, and Nuon didn’t: Steam is an established, successful brand, Steam is the leader in PC game sales, they have a lot of money, and they’ve already proven a capacity for long-term thinking with Steam itself… they have the resources to succeed where falling this far off the hype train was a death blow to 3DO and friends.

…or I could be wrong and Valve made a conscious decision to put Steam Machines on the back burner when the Windows Store failed to materialize as an active and growing threat. Either way, had Valve followed this advice in the early days, the Steam Machine concept would be in a much stronger position now.

Posted in Web Wandering & Opinion | Leave a comment

A Compromise Between Substring and Prefix Matching

A.K.A.: How to write what human intuition actually expects substring matching to be

While the changes aren’t yet ready to be pushed, I’ve been working on one of my hobby projects quite a bit over the last few days and I just thought I’d share a little something I stumbled upon while implementing a result filter box.

Systems with advanced string searching will often let you choose between prefix or substring matching, but I’ve found that both have glaring flaws when you’re implementing something like a “find as you type” launcher, where the goal is a fast match that’s “good enough”.

With substring matching, you quickly realize that computers are much better than humans at finding substrings in the darndest of places, making substring matching very counter-intuitive. (I get the impression that it has to do with humans thinking in syllables while computers don’t, so it’d be interesting to see how the effect changes in non-alphabetic writing systems, like Kanji or Hangul.)

By contrast, prefix matching is often overly specific and ill-suited to situations where many titles may begin with the same article (A, The, etc.) or the name of a series with many entries. Unfortunately, splitting off the articles, then moving them to the end, as Steam does, also has the potential to trip people up, so there’s no perfect solution.

The solution I developed, almost by accident, is essentially a half-way point between prefix matching and the full-blown keyword-based approach a search engine takes:

Use case-insensitive matching and require that substring matches begin at a word boundary.

This has the following desirable characteristics for a find-as-you-type solution:

  • It minimizes the need to press modifier keys, which require costly muscle synchronization:
    • It’s case-insensitive
    • There’s no need for users to quote literals to avoid them being reordered as would be necessary with a full-blown keyword search grammar (ie. “pirates of” won’t match “of pirates”)
  • It’s robust against variations in title formatting:
    • A search for “bri” will match both “The Bridge” and “Bridge, The” without also returning spurious results like “Abrix the robot”.
    • A search for “pir” will return “Space Quest III: The Pirates of Pestulon” without concern for how many Space Quest games sort earlier in the results, whether the title was transcribed using “3” or “III”, or “]|[“, whether the subtitle begins with “The”, or whether the separator is “: ” or ” – “.
  • It lacks the over-broadness that you find with substring matching, where “pir” will match “Drascula: The Vampire Strikes Back” and “Spirits”.

It’s also simple to implement:

  • For typical regexp searching, just prepend \b to the pattern and set the case-insensitive flag. (If your engine lacks \b, then use (^|\s) instead.)
  • For literal string matching on top of a regexp engine, just escape the pattern and follow my instructions for a regexp search.
  • For CMD.EXE-style wildcard matching, escape the pattern, then replace \? with . and \* with .* before prepending the \b.
  • For a manual implementation of literal-string matching on titles with normalized whitespace, just check whether it matches at the beginning (eg. title.lower().startswith(pattern.lower())) and then prepend a space and search within. (eg. (title.lower().index(' ' + pattern.lower())) >= 0)

UPDATE 2016-10-02: The \b word boundary token doesn’t consider parentheses to be part of a word, which I’ve found to be a confusing surprise in day-to-day use, so you’ll want to use (^|\b|\s) instead of \b. This will allow both “(Eng” and “Eng” to match “(English)” in typical usage for maximum intuitiveness.

In case you want to play around with this, here’s a quick sampling of how to regex-escape a string in various popular environments:

Posted in Geek Stuff | Leave a comment

Using OpenCV to automatically skip recurring post-roll ads

TL;DR: Install OpenCV-Python, download this script and follow the instructions in the script’s --help output.

While I like The Young Turks, they’ve recently started adding the same two or three carnival barker-esque appeals for subscribers to the end of all of their videos. That gets very annoying very quickly.

Since I don’t believe in rewarding bad behaviour (like forcing avid viewers to see the same couple of annoying ads a million times), I refuse to let them nag me into being a member. However, I still need something to occupy my mind while doing boring tasks, so I needed a solution.

As such, here’s a Python OpenCV script which will find the time offset of the first last occurrence of a frame in a video file (eg. a screenshot of the TYT title card that appears between the content and the ad) in a video file and then write an MPV EDL (Edit Decision List) file which will play only the portion of the video prior to the matched frame.

UPDATE: Hint: Put this script, your videos, and one or more screenshots (to be matched in a fallback chain, sorted by name) into the same folder and you can just double-click it.

I’ve also done the preliminary research to fuzzy-match the audio of those two or three naggy bits in case they decide to try to render this ineffective by moving the title card to the very end… partly because it would also give a more accurate cut point if used with the current clips.

(As is, I tend to lose the last 500 to 1500 milliseconds of actual content due to variations in how how they cut the pieces of each clip together… but, even if I lost an entire clip every now and then, it’d be an acceptable sacrifice to avoid those annoying nags. Current clips are cut together such that stopping at the last frame of the end-title card removes the nag perfectly.)

Posted in Geek Stuff | Leave a comment

“Gypsy Bard” and the My Little Pony Fandom’s Creative Output

As I’ve mentioned before, I’ve a certain fondness for throwing characters into interesting situations to see what makes them tick and, as I’ve also mentioned before, I got lured into the My Little Pony fandom by the wide selection of catchy fan-created music. I’ve decided I want to comment further on that.

While waiting for some files to transfer, I found myself reading a well-written self-insert fanfic (Damn you, recommendations sidebar! You always know just what’ll hook me next! ;P ) with an interesting plot point:

When our hero discovers that this isn’t just any My Little Pony setting, but, rather, that she’s “stuck in a snow globe” made of her own hypothetical musings, she tries to drink away the resulting burst of existential despair and winds up singing a song. There’s more to it, but I want to focus on the song:

Gypsy Bard Cover by Dreamchan feat. Princewhateverer

Ignoring the benefits of this kind of cross-referencing being common, let’s look at the context in which this song exists.

In addition to Dreamchan’s cover, this same song also has a remix by The Living Tombstone (with impressive visual accompaniment by olibacon), a cover with piano accompaniment by Flutterwhat, an orchestral-styled instrumental version by BassBeastDJ, and various other ones that don’t stand out from the crowd as clearly like this 8-bit cover.

Now, this isn’t unheard of. In fact, having a large constellation of covers, remixes, and the like, seems to be becoming more common. For example, see this cover and many others for the song Megalovania from Toby Fox’s Undertale or the many “Abridged Series” comedy dub edits. (Here’s an example clip with non-worksafe language taken from one of them.)

It’s not even unusual for a fandom to produce fanworks that go beyond ordinary covering and/or remixing. For example, here are some songs from the fans of Starcraft, Deus Ex, Portal, Star Wars Galaxies, and Mass Effect 2.

What makes MLP:FiM noteworthy is how much the fandom has been producing fanworks that go beyond merely remixing existing content. For example, the Deus Ex and Mass Effect 2 songs I linked were both by the same artist, Miracle of Sound, while the MLP fandom alone has artists like ponyphonic, WoodenToaster, and StormWolf, each having produced multiple songs.

The fans were even working on a fighting game, called Fighting is Magic, before Hasbro decided that went too far and sent them a Cease and Desist notice.

What makes Gypsy Bard so special is that it’s an original song written for episode 7 of an “Abridged Series” (I quote it because the creators consider it too divergent for that term to fit well) called “Friendship is Witchcraft”.

Why is that so important? Because of what it implies about the size and vitality of the fan community. Not only did they get together enough people to voice-act an Abridged Series, they also managed to find someone to write original songs. Then, on top of that, one of those songs inspired its own covers and remixes… one of which (Dreamchan’s) would have been based on Flutterwhat’s piano version if Princewhateverer hadn’t contributed a guitar version. In other words, it’s a third-order fan-work at minimum. (A fan-work of a fan-work of a fan-work… not counting indirect inspirations) Those are not common. In fact, until My Little Pony, the most I’d ever seen was a second order fan-work.

However, to be fair, with music, the main barrier to high-order fanworks is getting attention. What I originally saw were second-order fanfics, which require readers to be familiar with the original work first. (For example, a Harry Potter fanfic named Make a Wish was popular enough for other people to write fanfics in the universe it established.)

A closer musical analogue would be The Mirror Lies, a heavy metal song written as a tie-in to a My Little Pony fanfic named A Change of Face.

Nonetheless, I still maintain that the fandom for My Little Pony: Friendship is Magic has attracted an uncommon mix of quality and quantity when it comes to creative and skilled fans. Let’s just hope that it’s merely ahead of the curve and this kind of output will become increasingly common.

Posted in Web Wandering & Opinion | Leave a comment

Chickasaw Mountain by Leslie Fish

I’d like to take a few minutes to wax poetic on a song I just recently discovered:

Chickasaw Mountain by Leslie Fish (lyrics, buy)

There are so many reasons I love the song. However, I’ll focus on the lyrics since I’m not very good at explaining why I love music aside from contributing factors like “it sounds celtic” and “it incorporates violins”.

I’ll start by focusing on the most obvious layer of the lyrics:

It’s a folk ballad, where the singer tells of the Faustian bargain a friend made with a being known as the Lady of the Morning Star. I love the impression this layer of meaning gives.

At first, it makes various references to deals with The Devil, with lines like Call Her Lady of the Morning Star and making it his sister’s apple tree that he winds up hanging on.

…but, at the same time, it makes it very clear that this is NOT Lucifer, with the lines Name your goal; She won’t ask your soul and this passage:

Seek no level of God or Devil
She’s something older by far
Call Her Lady of the Morning Star

The overall impression is that The Lady of the Morning Star is some kind of primordial Fae-like being: An immortal with incomprehensible motivations, who views mortals as toys for her amusement and is so ancient that, if Lucifer does exist in this setting, he was likely named in reference to her.

The Fae feeling is further reinforced by the last line Any wild place on Earth will do!

(It’s a common theme that the Fae and other such creatures shy away from civilization and dwell in “the wild and untamed places of the world”.)

On a purely emotional level, this layer of meaning is all I care about and I can’t get over how much I love it.

However, on an intellectual level, there’s still more to come. Let’s move on to the second layer of meaning:

According to the subtitle, the song is a tribute to Phil Ochs.

Seen in this light, a second complete set of references emerge.

For example, these lines:

She offers two bargains; the price is deep and dark
One takes your life and the other leaves a mark

…and these lines…

Whoever has wisdom can guess what lies unsaid
The cost of the gift to the living and the dead
Still if you feel you’ll gain from the deal
You’ll play with the old Morning Star

If that’s not a metaphor for the “live fast and die or burn out young” pattern that takes so many great artists, I don’t know what is.

The rest of the song follows Mr. Ochs’s rise and fall closely, with phrases like these:

Made him the best of his generation
Sang till the end of the war
And not a moment more.

…which reference his status as one of the biggest names among Vietnam War protest singers and his subsequent descent into mental illness immediately thereafter… finally ending with the lyric Hanging on his sister’s apple tree, a reference to how, less than a month before the one-year anniversary of the end of the Vietnam War, Phil Ochs committed suicide by hanging while living with his sister.

Posted in Web Wandering & Opinion | Leave a comment

Working around Pidgin’s mis-designed certificate error dialog

The Pidgin developers apparently haven’t thought things through very well when it comes to TLS/SSL support because, if you want to connect to a network which uses a self-signed cert, they’ll present you with a permission dialog every time you connect (no “remember” option) and, last time I reported this, they considered it “WONTFIX: self-signed certs are a bug”.

They seem to think that it will force network operators to get proper certificates but, in reality, they don’t have that kind of leverage when every other IRC client allows you to ignore cert errors, so it just forces people to either turn off SSL or desensitizes them to the warning dialogs.

This is especially problematic for me because one of the networks I connect to is encrypted-only, managed by someone who doesn’t trust Let’s Encrypt and, if the Pidgin SSL handshake times out, it remembers the failure rather than re-displaying the prompt on reconnect. Because I have no idea how long the timeout is but it always seems too short, that trained me to punch “allow” on the annoying dialog as quickly as possible without wasting time reading the prompt… never a good sign.

So, today, I’m going to teach you how to get the best of both worlds: How to use stunnel to trick Pidgin into using self-signed SSL without complaining. (And, as a bonus, stunnel makes it easy to verify a self-signed cert without adding it to the system-wide cert store, so it can actually be more secure than the certs the Pidgin developers want you to use.)

First, install stunnel and set it to run on startup. On Debian-based distros like Ubuntu and Mint, this is as simple as running sudo apt-get install stunnel and then setting ENABLED=1 in /etc/default/stunnel4.

Next, we need to write a config file so that, when your client connects to a stunnel server on localhost without encryption, it will make an encrypted connection to the IRC server in question.

The key lines are as follows:

; Maximize security
; NOTE: See the manpage or sample config file for implications
chroot = /var/lib/stunnel4/
setuid = stunnel4
setgid = stunnel4

; Needed for stunnel to work properly
; (Prefix the contents of the chroot line if not using chroot)
pid = /stunnel4.pid

; Disable support for insecure SSLv2 protocol
options = NO_SSLv2

; Define the actual proxy service
client = yes
accept =
connect = irc.your-network.com:6697

Now, at this point, you’ve matched what you got from putting up with Pidgin’s security dialog:

  1. Put those lines into /etc/stunnel/whatever_you_want.conf
  2. Start stunnel (sudo /etc/init.d/stunnel4 start if you’re on Ubuntu 14.04 LTS or older)
  3. Set Pidgin to connect to the address in the accept line.

Pidgin will think it’s connecting to an un-encrypted IRC server and the connection will be encrypted between stunnel and the server.

However, we can do one better. If we can get the server certificate in PEM format, we can have stunnel verify it, preventing man-in-the-middle attacks.

The ideal solution would be to download the PEM file through a trusted channel but, as a stop-gap, let’s replicate the “trust whatever we see first” behaviour that SSH uses. Fill in your IRC server’s details in the following command and run it to dump the server cert:

openssl s_client -showcerts -connect irc.your-network.com:6697 </dev/null 2>/dev/null|openssl x509 -outform PEM >your-irc-network.pem

Now, copy the resulting your-irc-network.pem file into /etc/stunnel/ and add the following lines to your whatever_you_want.conf file:

CAfile = /etc/stunnel/your-irc-network.pem

You’ll probably also want to add these lines temporarily so you can see what’s going wrong if the verification fails:

debug = 7
output = /stunnel.log

…and, that’s it. Just restart stunnel, reconnect with your IRC client, and, barring verification errors, you should have an encrypted connection which verifies the self-signed certificate.

(I say “should” because, as of this writing, the self-signed cert I’m testing against is expired, so I can’t get all the way through the verification process to confirm.)

Posted in Geek Stuff | Leave a comment