[sugar] Develop i18n design (was Re: Develop activity (Oops...))

Thu Aug 16 17:07:52 EDT 2007

Le 07-08-15 à 01:15, Jameson "Chema" Quinn a écrit :

> This is a fun conversation, but I'd like to hear other voices.

So would I. Hope you don't mind that I'll still have questions and  
comments.
But first, a note: we obviously agree on the basics: The .py file  
should be readable without a special editor, and usable as such by  
the interpreter; translation should be a process after load and  
before save in the special editor (possibly with some state memory in  
between.) What we're arguing about is getting more and more marginal  
or implementation-related, and I believe that's a good thing.

> Let me try to rephrase the core positions and present a possible  
> compromise, to make it easier for others to weigh in on which they  
> support. Please correct me where I've misread you.
>
> 1. MAP position: Better to do the whole job
>
> If we're going to be translating, we should translate everything.  
> On-disk identifiers should ALL be either tagged with a language, or  
> in English. Being universal helps keep things clean.

So far so good, with a few minor caveats. Untagged foreign  
identifiers are possible, but the editor should make it convenient to  
replace them with an English equivalent whenever possible.
Private identifier translation is, I believe, a useful goal; as such  
it should be made possible by the toolset. But it is obviously very  
labour-intensive and should not be required.

> When importing a file, you import its dictionary, and the  
> dictionaries of all the files it imports (with intelligent  
> resolution of conflicts).

Hmmm...  I did not say that, but you seem to imply it is a  
consequence of what I'm proposing, and as you spent more time than I  
with implementation you may be right.
What is getting clearer and clearer to me is that there will be  
complicated cases, mostly involving methods defined in superclasses  
of imported classes and duck typing.
I do not know how to solve those; nor do I understand fully how you  
intend to.

> When creating a new English translation for an untranslated  
> identifier, the editor ??automatically applies it in all files with  
> that identifier??.

No, I did not say that. It applies it to the module where the  
identifier is defined, and as the magic editor opens files that  
import that module, they should notice the translation (in the  
translation history block) and apply it then.

> 1a. The translation logic knows enough python grammar to display  
> fake import statements.
> 1b. A non-English user will find editing their own files without  
> the "intelligent" tools pretty annoying, as nearly everything will  
> have language tags hanging off of it.

I am beginning to suspect that this is what is annoying you most  
about my proposal. And it is also a valid point. And it makes your  
compromise very appealing. (cont'd below)

> 3. Compromise: Do a good half job first, then optionally do the  
> whole job.
>
> When initially created, files have a "preferred" language and  
> things behave more-or-less as in 2. When somebody decides that the  
> "interface" for a file has a complete English translation, they can  
> convert that file into "English preferred", which puts language  
> tags on all identifiers not yet translated. This also creates an  
> "internal" section in that file's translation dictionary which is  
> not used by clients (until the user asks for it, and it's then  
> moved to the "public" section of the dictionary). An English- 
> preferred file can be edited in another language, and all new  
> identifiers are tagged with the UI language. Thus, English- 
> preferred files behave much as in option 1.

I like your compromise a lot. I was myself coming to the notion of a  
"default language tag" that would allow to not see language tags  
strewn around in a normal editor.
The only thing I would want to add to that spec is the possibility  
that someone with a language-aware editor set to German could edit a  
French-preferred file, even w/o English translations, by adding  
German language tags to new identifiers. So identifiers without tags  
are assumed to be French as it's the module's setting; new ones can  
still be tagged. What do you think? Or did you already have this in  
mind?

>> Related question: Do you expect translation to carry between modules?
>
> Yes, I expect it to carry across one level of importation. Not  
> across any module in the universe, oh God no.
>
>> Eg would translating file.close as fichier.fermer have an impact on
>> anything named 'close' in any module (such as StringIO)?
>> I admit I did not think so initially: translations were module-
>> specific in my mind. But I am now reconsidering; my initial position
>> certainly makes duck typing un-intelligible. But applying
>> translations broadly may be an issue in some cases, and introduce
>> more ambiguity.
>
> Again, this is the reason I want to start out NOT translating  
> module internals - overapplying half-assed internal translations.

I am afraid the problem I raised has nothing to do with internals.  
"close" is very much part of the public interface.
Let me reiterate:
Suppose two modules m1 and m2 define classe c1 and c2 respectively,  
both with a method X in their public interface.
m1 is translated, and states that X is iks in the current target  
language. m2 is untranslated.
We are trying to display a file that goes like this:

import m1, m2

def maFonction_i18n_fr(aParam):
    aParam.X()
EOF

We do not, cannot know whether aParam refers to an m1.c1 or m2.c2  
instance; possibly maFonction could receive both instances.
so we cannot know whether X is translated or not.
I am not sure how to handle it; how do you?

> >> a) I believe there should be one translation file per language.
> >> More file pollution, less parsing....
>
> Even if we don't have a scifi  live, wiki-style central translation  
> database, I think we can at least count on some sort of managed git  
> type deal. It would not be very hard to make a tool to merge in or  
> delete out languages from a given translation file, as long as  
> there was at least one canonical language for each file. So on my  
> computer, I only have the languages for my country, including  
> immigrant communities weighted by some compromise between the total  
> language size and its in-country size, plus English.

I agree we should be able to select which languages we are storing.  
But we are exchanging information, so the information on all  
languages probably exists somewhere (even if only in distributed  
form, which is the most difficult case.)
If the default file format mixes language information, it creates a  
burden of internal file management where there could be none. One  
file per language sidesteps a lot of manipulation.

> And even a pretty fragmented country, like Guatemala or Mexico,  
> doesn't have more than a few dozen languages, of which at most a  
> handful are over your cutoff of 100K. (Here, there are officially  
> 25 languages, of which I think about 5 beat 100K). I think that  
> translation burden and quality concerns become pretty unmanageable  
> below some number in 50-100K. So: say I have 12 languages in my  
> versions of the files (generous, I think) Say that each identifier  
> has an average of 10 versions. With your factor of 50*16*5 that  
> gives 40K of (easy) parsing (and memory use)... a noticeable, but  
> not an unmanageable, number. Still, you're right, indiscriminate  
> importation (which is a phase some new programmers go through)  
> could bring the OLPC to its knees as you opened Develop.
>
> But I'm actually saving on "disk" (NAND or whatever you call it)  
> space versus your proposal, because I'm giving the same languages  
> (I assume) without repeating the English in every file.

But if you use gettext, you save active-memory space due to some  
compression in the .mo file. Now, the .po file take up NAND space,  
granted. Hmmm... This is not a resource I am used to saving.
I am trying to think of a scheme to save space without adding too  
much complexity (I classify putting all text in a matrix as too  
complex.)

OK, here's where I'm at:
Each translation is indexed by the hash of the identifier (we can  
even store it as binary.)
Collisions may happen; in that case use the identifier in full.  
(Signal with a hash=0 index.) Should be rare enough to have a  
negligible impact on size.
The difficult case is when an identifier is introduced (in a normal  
editor) that collides with a previously translated identifier.
The solution is to maintain one file that contains all non-colliding  
identifiers with the hash they had before collision.
(Since translations are only created by the language-aware editor, it  
knows to maintain that file.)
New identifiers that collide with those will be detected by the new  
editor, and will allow to update translation files correctly.
Note that read access to that file should be rare; only when a  
collision is actually detected in the module. (However, saving the  
module involves maintaining the hash file.)
Ugh. Not that pretty either. But easily isolatable as a subsystem;  
and I am convinced that it beats merging data from disparate sources  
into a large matrix/paring down the matrix...
(I'd be willing to write that if you want.)

>> so-called "arabic" numerals (westernized indic!)
>  Indic? That's a new one on me!

Well, playing fast and loose here. Western numerals are adapted  
arabic which were adapted indic (Generic name for early Indian  
languages.) As far as we know, the modern positional system we use  
was invented there (as the story goes, to support Jain theological  
speculation about the infinity of worlds) and soon afterwards in  
China, very probably independently. I hate calling Western numerals  
"arabic", first because there is such a thing as modern arabic  
numerals, and I like to be able to refer to them as such; and second  
because the Arabs did not invent those numerals, merely passed them  
on to us from India.  But honestly they are Arabic origin, insofar as  
we (heirs of western Europe) did not get them directly from India...  
So my attempted witticism replaced one inaccuracy by another. The  
long story is nowhere near as convincing as the off-the-cuff remark,  
as is often the case ;-) Hope it's more interesting, anyway!

Cheers,
Marc-Antoine