Unicode in filenames

Thu Nov 15 09:54:38 EST 2007

Joyride-277 doesn't validate, because it contains a file from the
library with a filename in non-normalized unicode.  The file is named
'Annobo?n_Bioko-thumb.jpg', where the ? should be a separated accent
on the o, but it is actually stored on the filename with a combined
'o+accent' glyph.

Now, at first blush this is a bug in the (fast) contents verifier,
which I will fix: all strings should be unicode-normalized before they
are compared.  But it seems like this raises issues with (for example)
URLs to library content.  Should we enforce the constraint that all
filenames are unicode-normalized on disk, so that we can guarantee
that a (unicode-normalized) URL will always resolve correctly?
Otherwise we run the risk of someone editing a file and resaving it
with a name which *appears* identical, but is actually encoded
differently on disk, and having URLs to the file mysteriously break.

For the technically-minded, we're talking about using the UTF-8
encoding of Unicode Normalization Form D, as discussed (briefly) at
http://wiki.laptop.org/go/Canonical_JSON.  The problem has arisen
because the old libraries used normalized filenames, but we've
switched to installing the libraries from RPMs, and apparently
non-normalized filenames have snuck in.  If I were to hazard a guess,
I'd say that the tar command normalizes filenames as they are
archived, while RPM does not.

My proposal is to ensure that all filenames in the base system (at
least) are in normalization form D. I will write a checker in the
build process to ensure this, and we should probably eventually write
checkers for the activity/library bundle tools that will do the same.
 --scott

-- 
                         ( http://cscott.net/ )