[sugar] Develop i18n design (was Re: Develop activity (Oops...))

Fri Aug 17 16:04:12 EDT 2007

>
> But first, a note: we obviously agree on the basics: The .py file
> should be readable without a special editor, and usable as such by
> the interpreter; translation should be a process after load and
> before save in the special editor (possibly with some state memory in
> between.) What we're arguing about is getting more and more marginal
> or implementation-related, and I believe that's a good thing.

Agreed.

It seems that you're even favorable to my proposed compromise, so let's
focus our work hashing it out.

>
> > 3. Compromise: Do a good half job first, then optionally do the
> > whole job.
> >
> > When initially created, files have a "preferred" language and
> > things behave more-or-less as in 2. When somebody decides that the
> > "interface" for a file has a complete English translation, they can
> > convert that file into "English preferred", which puts language
> > tags on all identifiers not yet translated. This also creates an
> > "internal" section in that file's translation dictionary which is
> > not used by clients (until the user asks for it, and it's then
> > moved to the "public" section of the dictionary). An English-
> > preferred file can be edited in another language, and all new
> > identifiers are tagged with the UI language. Thus, English-
> > preferred files behave much as in option 1.
>
> I like your compromise a lot. I was myself coming to the notion of a
> "default language tag" that would allow to not see language tags
> strewn around in a normal editor.
> The only thing I would want to add to that spec is the possibility
> that someone with a language-aware editor set to German could edit a
> French-preferred file, even w/o English translations, by adding
> German language tags to new identifiers. So identifiers without tags
> are assumed to be French as it's the module's setting; new ones can
> still be tagged. What do you think? Or did you already have this in
> mind?

Hadn't thought of it, but it's not too hard.

>> Related question: Do you expect translation to carry between modules?
> >
> > Yes, I expect it to carry across one level of importation. Not
> > across any module in the universe, oh God no.
> >
> >> Eg would translating file.close as fichier.fermer have an impact on
> >> anything named 'close' in any module (such as StringIO)?
> >> I admit I did not think so initially: translations were module-
> >> specific in my mind. But I am now reconsidering; my initial position
> >> certainly makes duck typing un-intelligible. But applying
> >> translations broadly may be an issue in some cases, and introduce
> >> more ambiguity.
> >
> > Again, this is the reason I want to start out NOT translating
> > module internals - overapplying half-assed internal translations.
>
> I am afraid the problem I raised has nothing to do with internals.
> "close" is very much part of the public interface.
> Let me reiterate:
> Suppose two modules m1 and m2 define classe c1 and c2 respectively,
> both with a method X in their public interface.
> m1 is translated, and states that X is iks in the current target
> language. m2 is untranslated.
> We are trying to display a file that goes like this:
>
> import m1, m2
>
> def maFonction_i18n_fr(aParam):
>     aParam.X()
> EOF
>
> We do not, cannot know whether aParam refers to an m1.c1 or m2.c2
> instance; possibly maFonction could receive both instances.
> so we cannot know whether X is translated or not.
> I am not sure how to handle it; how do you?

Good job stating the case. Just one detail that I think you misstated: m1
states X is english, not the current target language, otherwise X would not
be showing up in the file on disk (as you state the problem, at worst it
would be ***m1lang___X*** which is NOT in the example I want).

Let's call the third module which is importing m3.

As stated, our editor knows that X exists in m1, because it has an entry in
the public dict. In my vision, the editor does not go willy-nilly scanning
actual .py files, it just imports public dicts, so it may or may not know
that m2 exports an X depending whether that's in m2.t7n .

If it knows, things are fine. As soon as you put an identifier in the public
dict of m2, that identifier is tagged on disk for safety. So we have
m2lang___X, and no collision, though this can show up either as (X,
m2lang___X), (en___X, X), or (en___X, m2lang___X) depending on m3lang. (and
if m1 has a good translation that would go in the first element of those
lists, and if m1 gives the null translation of X=X you could even end up
seeing the disambiguation (m1___X, m2___X). If you type X in that last case,
it shows up bright red as an error and the file refuses to save until you
fix it.)

If it does NOT know, then we have a problem, and our goal should be to let
an aware user notice it and fix it as soon as possible. Remember, for now
this file works as intended, but if someone comes along and gives m2 a new
English translation for X, everything will break.

If m3lang == m2lang, the original m3 programmer should have seen either
en___X or m1translation. If they type X, this will automatically be changed
to m3lang___X on disk, so their code never ran, and so they catch the error
and right-click on X and say 'add translation to file m2' and leave the
English blank because they don't speak English. And everything works fine
then because both files use m3lang___X on disk.

If m3lang == en, then we have a tougher problem waking the programmer up.
They would have to be moderately aware to notice that when they looked into
m2 to find that the method was called X, everything was colored non-English
and was in some funny language. So they should by instinct want to write
m2lang___X if they mean the m2 one. And then when that doesn't work, they
can easily fix the problem as above.

Say they don't wake up. Then somebody else comes along and adds a
translation to m2, saying that X is exported (either giving an English
translation Y or leaving it as m2lang___X; let's call those both Y). The
next time our English-speaking m3 programmer opens their file, our logic
tries to add the new translation, notices the conflict, and complains. It
turns all the Xes into WARNING___X or something, then tells the programmer
to search through for WARNING___X and change it either into X or Y,
depending on whether its from m1 or m2. (If they are the one to add the new
translation, all the better; if m3 is open when the translation is added,
they get the warning right away.)

That last paragraph is the only explicit coding we have to do for this case.
The rest of it all happens naturally, as a consequence of how things work
anyway.

> >> a) I believe there should be one translation file per language.
> > >> More file pollution, less parsing....
> >
> > Even if we don't have a scifi  live, wiki-style central translation
> > database, I think we can at least count on some sort of managed git
> > type deal. It would not be very hard to make a tool to merge in or
> > delete out languages from a given translation file, as long as
> > there was at least one canonical language for each file. So on my
> > computer, I only have the languages for my country, including
> > immigrant communities weighted by some compromise between the total
> > language size and its in-country size, plus English.
>
> I agree we should be able to select which languages we are storing.
> But we are exchanging information, so the information on all
> languages probably exists somewhere (even if only in distributed
> form, which is the most difficult case.)
> If the default file format mixes language information, it creates a
> burden of internal file management where there could be none. One
> file per language sidesteps a lot of manipulation.

OK.  Here are the issues:
file pollution
parsing CPU burden
memory burden
file manipulation burden
        related: human-readable and editable (with text editor, or just with
spreadsheet?)
disk space

... I think I could come up with a plan to optimize for any 3 of those
attributes, maybe 4.but not for all 6. So what are the priorities? On OLPC:
I I'd say memory, human-readable, and disk space. Which, darn it, gives you
the win here over all my 'clever' (ie homemade and nonstandard) data format
designs. But one problem is that that does mean a real possibility that file
versions will desynchronize, and that looks like a problem to me... do you
have a plan?

And apparently you're seeing a totally different issue, because I just can't
figure out the problem that the following is intended to remedy, or what
you´re talking about at all. For me, a giant matrix is fundamental, and
we're just arguing about how to slice it up on disk, it appears you have a
different idea. Remember, as in the above discussion, that even a row in
that matrix with only one untranslated word in it can works to announce that
that word is publicly exported by a given file.... so how do we keep the
rows in sync between subfiles? Also, this would mean that you need at least
English AND the preferred language in each subfile.

OK, here's where I'm at:
> Each translation is indexed by the hash of the identifier (we can
> even store it as binary.)
> Collisions may happen; in that case use the identifier in full.
> (Signal with a hash=0 index.) Should be rare enough to have a
> negligible impact on size.
> The difficult case is when an identifier is introduced (in a normal
> editor) that collides with a previously translated identifier.
> The solution is to maintain one file that contains all non-colliding
> identifiers with the hash they had before collision.
> (Since translations are only created by the language-aware editor, it
> knows to maintain that file.)
> New identifiers that collide with those will be detected by the new
> editor, and will allow to update translation files correctly.
> Note that read access to that file should be rare; only when a
> collision is actually detected in the module. (However, saving the
> module involves maintaining the hash file.)
> Ugh. Not that pretty either. But easily isolatable as a subsystem;
> and I am convinced that it beats merging data from disparate sources
> into a large matrix/paring down the matrix...
> (I'd be willing to write that if you want.)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.laptop.org/pipermail/sugar/attachments/20070817/16520566/attachment-0001.htm