Unicode in filenames

Albert Cahalan acahalan at gmail.com
Fri Nov 16 00:37:22 EST 2007


On Nov 15, 2007 11:51 PM, C. Scott Ananian <cscott at laptop.org> wrote:
> On Nov 15, 2007 2:43 PM, Albert Cahalan <acahalan at gmail.com> wrote:
> > C. Scott Ananian writes:
> > The accepted standard is to use precomposed glyphs. This is
> > compatible with the Linux kernel, with Windows, and with many
> > other things that you aren't about to change.
>
> The linux kernel doesn't care one way or another.  Filenames are just
> byte strings.

The kernel sure does care. Try "ls" on the console.
I think the default font has C with cedilla; try that.

Also, filenames are NOT just byte strings once the
kernel needs to deal with non-native filesystems.
This includes vfat, cifs, and iso9660 with Joliet.

> Windows doesn't play well with unicode, period -- case insensitivity
> causes a real mess -- and is pretty irrelevant anyway.

If it's pretty irrelevent, then Sugar should automatically
convert USB storage devices to ext3. :-)

> There is no 'accepted standard'.  The w3c recommends the use of NFC
> for information exchange; that's as close to a specific recommendation
> as I could find.  NFC is "form NFD, then do canonical composition", so
> there's an efficiency argument to be made in favor of using NFD.

That sounds like an accepted standard to me. It's not just
the W3C though. You'll get something like form C or KC
on a USB storage device that comes from a Windows box.
Based on the W3C agreeing with Microsoft, I guess you'll
also see it in URLs.

> > Normalization form C or KC would be far better, but I still don't
> > think this is something that should be enforced.
>
> As I described in my original mail, not enforcing a standard -- at
> least for core files and for things written by sugar and activities --
> will lead to madness: "identical" filenames/urls which aren't found
> when expected.

If you're having trouble with filenames staying one way,
you're likely to have trouble the other way too. The kernel
is not about to enforce this; it doesn't even enforce UTF-8.

> > Files in the base system should only use characters that can be found
> > on **all** keyboards and in **all** fonts. Probably this means ASCII.
>
> I believe SJ would beg to differ: the only files we currently have
> which are non-ASCII are in the library.

Can those be typed on...

the West African keyboard?
the Arabic keyboard?
the Cyrillic keyboard?
the Devarangi keyboard?
...

If not, then they need to go.

Can those be viewed with...

the Terminal activity?
the Browse activity?
the Linux console?
the Journal activity?
the eToys activity?
...

If not, then they need to go.



More information about the Devel mailing list