Unicode in filenames

Dan Williams dcbw at redhat.com
Thu Nov 15 14:15:57 EST 2007


On Thu, 2007-11-15 at 09:54 -0500, C. Scott Ananian wrote:
> Joyride-277 doesn't validate, because it contains a file from the
> library with a filename in non-normalized unicode.  The file is named
> 'Annobo?n_Bioko-thumb.jpg', where the ? should be a separated accent
> on the o, but it is actually stored on the filename with a combined
> 'o+accent' glyph.
> 
> Now, at first blush this is a bug in the (fast) contents verifier,
> which I will fix: all strings should be unicode-normalized before they
> are compared.  But it seems like this raises issues with (for example)
> URLs to library content.  Should we enforce the constraint that all
> filenames are unicode-normalized on disk, so that we can guarantee
> that a (unicode-normalized) URL will always resolve correctly?

Everything on disk should be UTF-8.  Anything that's not UTF-8 will not
be guaranteed to work.  Filenames need to be converted to UTF-8 before
the file is opened/created/renamed/etc.

Dan

> Otherwise we run the risk of someone editing a file and resaving it
> with a name which *appears* identical, but is actually encoded
> differently on disk, and having URLs to the file mysteriously break.
> 
> For the technically-minded, we're talking about using the UTF-8
> encoding of Unicode Normalization Form D, as discussed (briefly) at
> http://wiki.laptop.org/go/Canonical_JSON.  The problem has arisen
> because the old libraries used normalized filenames, but we've
> switched to installing the libraries from RPMs, and apparently
> non-normalized filenames have snuck in.  If I were to hazard a guess,
> I'd say that the tar command normalizes filenames as they are
> archived, while RPM does not.
> 
> My proposal is to ensure that all filenames in the base system (at
> least) are in normalization form D. I will write a checker in the
> build process to ensure this, and we should probably eventually write
> checkers for the activity/library bundle tools that will do the same.
>  --scott
> 




More information about the Devel mailing list