Stripping Emoji from File And Folder Names

There’s an annoying little problem that happens sometimes when you save stuff off the web, which the popularity of emoji has brought into the spotlight: Some tools still assume Unicode code points will fit within 16 bits and break with characters outside the Basic Multilingual Plane (BMP for short).

I’ve seen this happen with git gui where I had to adjust the test suites for some Unicode-processing code to use character escapes instead but, in this case, the problem is astral characters in filenames.

If you’ve ever used something like Ctrl+S on a Tumblr page or youtube-dl on a video that uses emoji in the title, you might have discovered that CD/DVD-burning GUIs like K3b run into errors with mkisofs/genisoimage when you try to save such files onto a typical Joliet+RockRidge ISO.

It used to be just one or two, so I’d rename them away manually within K3b before burning the disc but, now, I’m starting to see a lot of them… so I wrote a quick little script that recurses through one or more folders (I forgot the usual “Is this a file? Skip os.walk and go straight to the file handler” code, so no file paths) and renames away any codepoints above 0xFFFF.

I’m not sure what Windows tools might break on non-BMP codepoints in filenames, but I habitually steer clear of anything I know will complicate making scripts portable so it should work anywhere you’ve got Python 3 installed.

Enjoy.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Strip emoji and other non-BMP codepoints from paths to make them compatible
with mkisofs/genisoimage"""
# Prevent Python 2.x PyLint from complaining if run on this
from __future__ import (absolute_import, division, print_function,
with_statement, unicode_literals)
__author__ = "Stephan Sokolow (deitarion/SSokolow)"
__appname__ = "strip_emoji.py"
__version__ = "0.1"
__license__ = "MIT"
import logging, os, re
log = logging.getLogger(__name__)
NON_BMP_RE = re.compile(r"[\U00010000-\U0010FFFF]")
def process_path(path):
replaced = NON_BMP_RE.sub('', path)
if replaced != path:
os.rename(path, replaced)
return replaced
def process_arg(path):
for path, dirs, files in os.walk(path):
dirs.sort()
dnew = [process_path(os.path.join(path, x)) for x in dirs]
dirs[:] = [os.path.basename(x) for x in dnew]
for fname in files:
process_path(os.path.join(path, fname))
def main():
"""The main entry point, compatible with setuptools entry points."""
from argparse import ArgumentParser, RawDescriptionHelpFormatter
parser = ArgumentParser(formatter_class=RawDescriptionHelpFormatter,
description=__doc__.replace('\r\n', '\n').split('\n--snip--\n')[0])
parser.add_argument('--version', action='version',
version="%%(prog)s v%s" % __version__)
parser.add_argument('-v', '--verbose', action="count",
default=2, help="Increase the verbosity. Use twice for extra effect.")
parser.add_argument('-q', '--quiet', action="count",
default=0, help="Decrease the verbosity. Use twice for extra effect.")
parser.add_argument('path', action="store", nargs="+",
help="Path to operate on")
# Reminder: %(default)s can be used in help strings.
args = parser.parse_args()
# Set up clean logging to stderr
log_levels = [logging.CRITICAL, logging.ERROR, logging.WARNING,
logging.INFO, logging.DEBUG]
args.verbose = min(args.verbose - args.quiet, len(log_levels) - 1)
args.verbose = max(args.verbose, 0)
logging.basicConfig(level=log_levels[args.verbose],
format='%(levelname)s: %(message)s')
for path in args.path:
process_arg(path)
if __name__ == '__main__': # pragma: nocover
main()
# vim: set sw=4 sts=4 expandtab :

CC BY-SA 4.0 Stripping Emoji from File And Folder Names by Stephan Sokolow is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This entry was posted in Geek Stuff. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

By submitting a comment here you grant this site a perpetual license to reproduce your words and name/web site in attribution under the same terms as the associated post.

All comments are moderated. If your comment is generic enough to apply to any post, it will be assumed to be spam. Borderline comments will have their URL field erased before being approved.