Unicode in filenames

Bert Freudenberg bert at freudenbergs.de
Thu Nov 15 15:18:43 EST 2007


On Nov 15, 2007, at 20:15 , Dan Williams wrote:

> On Thu, 2007-11-15 at 09:54 -0500, C. Scott Ananian wrote:
>> Joyride-277 doesn't validate, because it contains a file from the
>> library with a filename in non-normalized unicode.  The file is named
>> 'Annobo?n_Bioko-thumb.jpg', where the ? should be a separated accent
>> on the o, but it is actually stored on the filename with a combined
>> 'o+accent' glyph.
>>
>> Now, at first blush this is a bug in the (fast) contents verifier,
>> which I will fix: all strings should be unicode-normalized before  
>> they
>> are compared.  But it seems like this raises issues with (for  
>> example)
>> URLs to library content.  Should we enforce the constraint that all
>> filenames are unicode-normalized on disk, so that we can guarantee
>> that a (unicode-normalized) URL will always resolve correctly?
>
> Everything on disk should be UTF-8.  Anything that's not UTF-8 will  
> not
> be guaranteed to work.  Filenames need to be converted to UTF-8 before
> the file is opened/created/renamed/etc.
>
> Dan

You're missing the point. Which is, that there are several equivalent  
Unicode sequences that can be used to represent the same characters.  
To compare, you have to normalize them. See

http://www.unicode.org/reports/tr15/

Using UTF-8 to encode the characters is not the problem, agreeing on  
the kind of normalization is.

- Bert -

>
>> Otherwise we run the risk of someone editing a file and resaving it
>> with a name which *appears* identical, but is actually encoded
>> differently on disk, and having URLs to the file mysteriously break.
>>
>> For the technically-minded, we're talking about using the UTF-8
>> encoding of Unicode Normalization Form D, as discussed (briefly) at
>> http://wiki.laptop.org/go/Canonical_JSON.  The problem has arisen
>> because the old libraries used normalized filenames, but we've
>> switched to installing the libraries from RPMs, and apparently
>> non-normalized filenames have snuck in.  If I were to hazard a guess,
>> I'd say that the tar command normalizes filenames as they are
>> archived, while RPM does not.
>>
>> My proposal is to ensure that all filenames in the base system (at
>> least) are in normalization form D. I will write a checker in the
>> build process to ensure this, and we should probably eventually write
>> checkers for the activity/library bundle tools that will do the same.
>>  --scott
>>
>






More information about the Devel mailing list