[sugar] Develop i18n design (was Re: Develop activity (Oops...))

Wed Aug 15 01:15:57 EDT 2007

This is a fun conversation, but I'd like to hear other voices. Let me try to
rephrase the core positions and present a possible compromise, to make it
easier for others to weigh in on which they support. Please correct me where
I've misread you.

1. MAP position: Better to do the whole job

If we're going to be translating, we should translate everything. On-disk
identifiers should ALL be either tagged with a language, or in English.
Being universal helps keep things clean. When importing a file, you import
its dictionary, and the dictionaries of all the files it imports (with
intelligent resolution of conflicts). When creating a new English
translation for an untranslated identifier, the editor ??automatically
applies it in all files with that identifier??.

1a. The translation logic knows enough python grammar to display fake import
statements.
1b. A non-English user will find editing their own files without the
"intelligent" tools pretty annoying, as nearly everything will have language
tags hanging off of it.

2. JAQ position: Better to do a good half job

Translating everything is an impossible task, we should try to encourage
focusing translation efforts where they are most natural, produce the most
benefits, and get the most attention (thus quality); this means means on the
interfaces between modules. On-disk identifiers should be either (assumed)
in a file's preferred language, in English, or tagged as untranslated
imports from a different-language file. When importing a file, you import
its dictionary but not those of all the files it imports (except, as a
special case, method names and method-call variables of its explicit
superclasses). When creating an new English translation for an untranslated
identifier, the editor (tries to figure out what imported files to apply it
to, but) requires user guidance to apply that translation to 1 file only.

2a. The only python that the translation logic knows is how to read "import"
statements. Everything else is display-only and it never changes the syntax,
just the lexes.
2b. Editing with a dumb editor is possible (assuming alphabet issues can be
resolved). This is somewhat more dangerous in terms of introducing rare
bugs, if there are incompletely-translated files imported in different
languages, but it is a viable option.

3. Compromise: Do a good half job first, then optionally do the whole job.

When initially created, files have a "preferred" language and things behave
more-or-less as in 2. When somebody decides that the "interface" for a file
has a complete English translation, they can convert that file into "English
preferred", which puts language tags on all identifiers not yet translated.
This also creates an "internal" section in that file's translation
dictionary which is not used by clients (until the user asks for it, and
it's then moved to the "public" section of the dictionary). An
English-preferred file can be edited in another language, and all new
identifiers are tagged with the UI language. Thus, English-preferred files
behave much as in option 1.

3a. a choice between 1a and 2a, I don't see a compromise.
3b. This leads to a natural compromise here.

Those are simplified positions, there's a lot of tricky details unsaid on
either side, but I hope I've at least given them good taglines.

As for the details of your message:

On 8/14/07, Marc-Antoine Parent <maparent at gmail.com> wrote:
>
>
> > 2. It also means that the dictionary for a given module could get
> > filled up with translations of each functions internal variables, etc.
>
> I see that as a desired feature. Code is always its own best
> documentation; reading code from experienced coders is the best
> learning strategy. So being able to display globally translated code
> feels necessary to me.

See above. I think this is our main philosophical disagreement; I'd rather
not pollute the translation map with imported unreliable internal
translations, sooner or later synonyms will bite you.

>
>
> One thing I wonder about, from some of your remarks: If module X
> imports module Y, do you expect translations of Y identifiers (and
> the name 'Y' itself) to be found in the X translation file? I reread
> you and it seems not, but I'd like to make absolutely sure.

No. DRY, SPOT, and all that. Doing this wrong was the hairiest of the beasts
in my first, abortive, attempt to do this.

Related question: Do you expect translation to carry between modules?

Yes, I expect it to carry across one level of importation. Not across any
module in the universe, oh God no.

Eg would translating file.close as fichier.fermer have an impact on
> anything named 'close' in any module (such as StringIO)?
> I admit I did not think so initially: translations were module-
> specific in my mind. But I am now reconsidering; my initial position
> certainly makes duck typing un-intelligible. But applying
> translations broadly may be an issue in some cases, and introduce
> more ambiguity.

Again, this is the reason I want to start out NOT translating module
internals - overapplying half-assed internal translations.

I will think more about this, and I am curious about
> your thoughts. (I am mostly curious to see if this is how you saw
> things.)
>
> > If you have a good solution for the problems I mention above, I'd
> > be happy to consider it. As it is, I'm not saying it's impossible,
> > but my feeling as I tried to code it initially was that if you try
> > to make things too "magic", eventually you're going to make the
> > wrong assumption and you're going to create some bugs in the code
> > being edited that are really really hard to track down.
>
> I do not believe I introduce so much magic; so I am wondering whether
> there is a misunderstanding in some key assumptions that make what
> you attempted very different from what I am trying to explain.
> But then, I must admit I did not try coding either!

I think I understand you better, and no, I don't think your idea would be so
impossible to implement. It would lead to very crowded translation dicts
though, and that is IMO a bad thing (synonyms and just poor-quality
translations would both bite you).

> >  And what is the benefit? Modules which are a "language soup",
> > maintained by an international coalition of children who can't even
> > talk to one another.
> >
> > I support the idea of international, cross-language programming
> > collaboration. However, I think that a basic assumption would be
> > that there were at least one common language of communication
> > between the participants, and that any given module has an owner
> > and thus a preferred language for all its still-untranslated
> > identifiers. If somebody wanted to "add to" that module, they'd
> > have to either write in that preferred language or use an explicit
> > import and subclass (or, of course, just do it in their language
> > for their own private use, because that file would then be
> > explicitly marked as "messy and suggested not for sharing")...
>
> You are quite right, this is the crux of the argument: are there real
> use cases for it?
>
> First, I agree with you that children who can read nothing about a
> module cannot use it; for example, we assume that the tool is useful
> to non-english speakers insofar as a translation is provided in a
> language they understand.
> So we can assume that if people are collaborating on a module, there
> are common languages. Not necessarily one, however; If I write a
> module in French, a Belgian can provide a Flemish translation, that a
> South  African should be able to decipher and edit in Zulu. This does
> not require me to understand Zulu; and more important it does not
> require the South African to read French. I think that this would be
> a realistic benefit.
> It does, however, require the Belgian kid to keep busy translating if
> I am to re-use the code from the Zulu-speaking child! How likely is
> that, realistically? You may be right, and that process may be one-
> way only. Yet again... In my view, it depends on the difficulty or
> ease of making and distributing translations (which is why I was
> thinking of distributed databases...)
> Also, it depends on language community size... Suppose the module is
> written in English, and two people maintain a Spanish and a Russian
> translation independently... it is quite believable for improvements
> to be made in either language that could not be read by the other
> community; yet English as the hub makes it likely that they will see
> one another's improvements.
>
> Let me summarize: my use case for doing this could be stronger; but I
> would argue it does exist, and for my part I am not yet convinced
> that identifier-specific language tags present such difficulties as
> you suggest. Again, I would like to understand better what problems
> you ran into (other than those above, which I think I mostly
> answered, save point 4.)

Nothing to add. I still feel my way.

> >> a) I believe there should be one translation file per language.
> >> More file pollution, less parsing.
> >
> > My current mockup is going to use csv for its dictionary file
> > format - Engish in the first column and one language per additional
> > column, in order of addition. Internally it is just a 2d array - a
> > list of lists. A "row" is only as long as it has to be and can have
> > any number of "empty" values as long as at least one value is
> > defined. The parsing on this file is, needless to say, trivial. It
> > would not be much harder if you converted it to XML, with one word
> > in one language per line, and you'd get more-fine-grained diffs for
> > free.
>
> Hmmm... Here, allow me to insist. You are mentioning Garifuna, so I
> assume you are interested (as I am) in languages with a comparatively
> small population base. According to Ethnologue, there are > 6900
> living languages, of which >1200 are spoken by populations of 100K or
> more. Assume only half of that is represented; assume identifiers are
> 16 bytes on average (not generous at all: multibyte in utf-8 often
> costs double, when not triple!); assume (round number) 50 public
> identifiers per module; assume 5 modules are imported, we're well
> over 2 Mb worth of parsing... I honestly think this is worth avoiding.

Even if we don't have a scifi  live, wiki-style central translation
database, I think we can at least count on some sort of managed git type
deal. It would not be very hard to make a tool to merge in or delete out
languages from a given translation file, as long as there was at least one
canonical language for each file. So on my computer, I only have the
languages for my country, including immigrant communities weighted by some
compromise between the total language size and its in-country size, plus
English.

And even a pretty fragmented country, like Guatemala or Mexico, doesn't have
more than a few dozen languages, of which at most a handful are over your
cutoff of 100K. (Here, there are officially 25 languages, of which I think
about 5 beat 100K). I think that translation burden and quality concerns
become pretty unmanageable below some number in 50-100K. So: say I have 12
languages in my versions of the files (generous, I think) Say that each
identifier has an average of 10 versions. With your factor of 50*16*5 that
gives 40K of (easy) parsing (and memory use)... a noticeable, but not an
unmanageable, number. Still, you're right, indiscriminate importation (which
is a phase some new programmers go through) could bring the OLPC to its
knees as you opened Develop.

But I'm actually saving on "disk" (NAND or whatever you call it) space
versus your proposal, because I'm giving the same languages (I assume)
without repeating the English in every file.

>
> > I have found one case where you actually need to do a manual
> > disambig for every case of an identifier in your file. Say you are
> > importing an English color_module with the translation red -> rojo.
> > You're also importing an UNTRANSLATED spanish_networking_module
> > which uses the identifier red. Now you want to add the translation
> > network -> red to spanish_networking_module. If you were working in
> > Spanish, the editor could have warned you when you typed "red" and
> > helped you change it into "es___untranslated___red"
> >
> One possibility I thought about, which may help a bit: start with
> that. I.e. the intelligent editor (at least, though I realize we
> cannot rely on it) would write es___untranslated___red in the .py
> file in the first place.

Which .py?  my_network_colors or spanish_networking_module? If you mean the
latter, yes, this is a good idea that I hadn't thought of, but it also
introduces bloat and makes a dumb-editor harder to use for the file's
author.

That means less ambiguity in the actual .py file, esp. if it is used
> later in a non-language aware editor.

less ambiguity in my_network_colors, right?

That said, the user will only see rojo and (spanish) red, right?

if my_network_colors is in Spanish. But if it's in English, they'll see red
(colored english), unless the editor has some reason to know that red has
been imported from spanish_networking_module.py. Hopefully, if they ever
mean the spanish one, they'll notice it's colored English, and say "no I
mean the one from s_n_m" at which point it will turn to es_untranslated_red
and red will be added (without translation) to the spanish column of the
dictionary for s_n_m. Adding the translation later will be no problem.

Cheers,
Jameson

ps...

> so-called "arabic" numerals (westernized
> indic!)

 Indic? That's a new one on me!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.laptop.org/pipermail/sugar/attachments/20070814/0fbeb235/attachment-0001.htm