A more formal way to think about validity of input data

I’ve begun to port one of my hobby projects from Python to Rust and, while setting up the clap argument parser, I found myself having to bind to the access(2) libc function myself.

Yes, it exposes you to a race condition exploit if you’re not careful, because the permissions could change between checking and depending on them. Yes, it’s a documented fact that it may be more permissive than actually attempting to access the filesystem. (I believe the situation I’m remembering was “access() doesn’t consider ACLs when evaluating permissions”) …but how else am I to implement a “fail early” check for “Can I create files in this directory?” when there exist real in-the-wild examples of filesystems (eg. AFS) having been configured to allow the creation of a hypothetical test file, but not the subsequent deletion?

That said, despite my intent to use Rust to ensure I handle every recoverable error case, there’s still a certain appeal to being able to point to a spot and say “beyond this point, this piece of data is trustworthy”.

Thinking about this made me realize a nice, simple way to think about handling input data. By analogy to passing by value (with deep copying) or by reference.

NOTE: While my examples will all use command-line arguments, this applies to any kind of input data.

Value Arguments

If a command-line argument cannot become invalid after being validated, then it’s a value argument. Examples of this include:

  • Boolean flags like “mirror this print job”
  • Integers representing things like the number of copies of a document to print
  • Strings which can’t experience any kind of namespace collision

You can validate value arguments once and then trust that they’ll stay valid.

Reference Arguments

If an argument depends on something outside your control to determine its validity, then a validity check only applies to the instant you perform it. Common examples of “reference arguments” include:

  • Filesystem paths (Between the check and use, permissions could change, a creation/deletion/rename could invalidate the path, etc.)
  • File descriptors (Even a supposedly local file descriptor could be on a network-mounted drive which goes away)
  • Strings used to create filenames (someone could create a file with that name which you lack the permissions to manipulate)
  • Network addresses
  • Cached results of arbitrary checks

This means that you need to be prepared for the unexpected every time you use a reference argument and you can only check separately from using them if the following conditions are met:

  1. The check has no security implications and can be safely removed
  2. You accept that the check could fail but the attempt could still succeed
  3. You accept that the check could succeed but the attempt could still fail

Examples

Argument Type Why?
--mirror-print-job
Boolean Value Nothing external to the program will invalidate this.

(The only way this could be a reference is if there were some kind of wrapper which detected the orientation of pre-punched cardstock in the printer and then did or didn’t pass this flag. The user could invalidate it by flipping/rotating the card stock before the print job actually begins.)

 --erase-disc
Boolean Reference The flag implies that either the user or the code detected a rewritable CD/DVD, but the user could swap in a non-rewritable disc before it actually gets used if the script does something long-running first, like generating an ISO in /tmp

Because you can only erase a rewritable disc, this must be validated as late as possible. (ie. After the drive tray has been locked and right before the operation would take place)

 Number of copies to print Integer Value The only relevant detail which can change is how much paper is in the printer, and, if there isn’t enough, the proper solution isn’t to reduce the size of the print job.
 File descriptor Integer Reference The descriptor could be pointing at a resource on a network-attached device that goes away.
Document Title String Either Whether to treat this as a reference depends on where it will end up and how you handle failure.

If you’re converting an eBook with ebook-convert from Calibre, then it’s a value because the output filename is specified separately and whether your title will override the source file’s metadata is not up for debate.

Output Filename String Reference No matter how many times you validate, it’s possible that a read-only file will have taken that name by the time you call open()

The Takeaway

  • Think in terms of how one piece of data depends on another and don’t forget that dependencies can extend outside of your program.
  • Whether a piece of data can be validated once and then trusted is unrelated to its data type or how it’s passed within your code. (You can pass a filename or URL by value but it’s still a reference to an external resource. A network filesystem will subvert your expectations for how reliable it is to hold an open file descriptor. etc. etc. etc.)
  • The definition of “valid” for a piece of data may depend on how your program is intended to be used. (A human might specify a filename and re-run your tool if it’s already taken. From your perspective, that means it’s valid even if it causes the process to abort. A GUI frontend, on the other hand, probably won’t know how to detect that kind of failure and retry. Expose a more foolproof API by using something like mkstemp or mkdtemp and then returning the newly-created path.)
  • Functions like access which check the validity of a reference are unreliable and should only be used to catch obvious mistakes early so the user doesn’t have to waste their time waiting for a failure that could have been anticipated. If it’s unsafe to comment them out, you’re doing it wrong.
    (eg. You can use access to detect read-only target directories before you know the exact output filename… with the caveat that they could be made read-only between the check and the attempt to actually write the file.)

CC BY-SA 4.0 A more formal way to think about validity of input data by Stephan Sokolow is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This entry was posted in Geek Stuff. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

By submitting a comment here you grant this site a perpetual license to reproduce your words and name/web site in attribution under the same terms as the associated post.

All comments are moderated. If your comment is generic enough to apply to any post, it will be assumed to be spam. Borderline comments will have their URL field erased before being approved.