Stripping Emoji from File And Folder Names

There’s an annoying little problem that happens sometimes when you save stuff off the web, which the popularity of emoji has brought into the spotlight: Some tools still assume Unicode code points will fit within 16 bits and break with characters outside the Basic Multilingual Plane (BMP for short).

I’ve seen this happen with git gui where I had to adjust the test suites for some Unicode-processing code to use character escapes instead but, in this case, the problem is astral characters in filenames.

If you’ve ever used something like Ctrl+S on a Tumblr page or youtube-dl on a video that uses emoji in the title, you might have discovered that CD/DVD-burning GUIs like K3b run into errors with mkisofs/genisoimage when you try to save such files onto a typical Joliet+RockRidge ISO.

It used to be just one or two, so I’d rename them away manually within K3b before burning the disc but, now, I’m starting to see a lot of them… so I wrote a quick little script that recurses through one or more folders (I forgot the usual “Is this a file? Skip os.walk and go straight to the file handler” code, so no file paths) and renames away any codepoints above 0xFFFF.

I’m not sure what Windows tools might break on non-BMP codepoints in filenames, but I habitually steer clear of anything I know will complicate making scripts portable so it should work anywhere you’ve got Python 3 installed.

Enjoy.

CC BY-SA 4.0 Stripping Emoji from File And Folder Names by Stephan Sokolow is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This entry was posted in Geek Stuff. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

By submitting a comment here you grant this site a perpetual license to reproduce your words and name/web site in attribution under the same terms as the associated post.

All comments are moderated. If your comment is generic enough to apply to any post, it will be assumed to be spam. Borderline comments will have their URL field erased before being approved.