[sugar] Develop i18n design (was Re: Develop activity (Oops...))

Tue Aug 14 22:08:50 EDT 2007

First, thanks for taking the time to respond to my suggestions. As  
you tactfully pointed out, you are doing the coding; and code beats  
ideas, and beats ideals all the more so.

That said, I still hope to persuade you that the more complex design  
I outlined is desirable, and maybe even realistic ;-)

First your objections

> I originially intended a design something like this. Here are the  
> problems I found:
> 1. requires an assumption that all those who edit a given module  
> have our magic editor,

Not necessarily. Someone using a text editor would not tag their  
identifiers, which will initially be assumed to be in English;
and someone with the magic editor looking over that code might later  
be able to tag some of those (oh, look, here's a qeqchi word! let's  
retag...)

> and that they all have their "preferred language" set correctly.  
> (Imagine a Belizian child who left it set on their country's  
> default "English" but actually edited in Spanish, Garifuna, Creole,  
> or Qeqchi).

a) Not that likely an issue in our use case imho. The child is  
assumed to only understand one language well (say Garifuna),  
otherwise what is the point? S-he would set up the UI in that  
language, and the editor would get that ui preference.
b) Still, you have a point; so the magic editor should allow post-hoc  
identifier re-categorization.

> 2. It also means that the dictionary for a given module could get  
> filled up with translations of each functions internal variables, etc.

I see that as a desired feature. Code is always its own best  
documentation; reading code from experienced coders is the best  
learning strategy. So being able to display globally translated code  
feels necessary to me.

> 3. I see no way of parsing a file to see what is "public" (for  
> importing) and what is "private" (module internal). Thus the entire  
> dictionary for a module would have to be imported with the module.

I did not think about how to separate them; I would indeed tend to  
read the entire dictionary, which I admit may incur a real memory  
cost (and why I am interested in leveraging gettext etc.)
However, nothing forces anyone to translate internal identifiers  
(though I still believe it's a good idea!)

> 4. For similar reasons, modules would have to import the  
> dictionaries for modules 2 levels up in the import inheritance.

You mean for the case where the module name gets translated? Yes, I  
would expect to read the package translation files for that.
There should also be one file for top-level packages.

> My paradigm lets you manually do this only for the rare cases it's  
> needed.

I am not sure how your paradigm solves the issue of translated module  
names; please expand.

One thing I wonder about, from some of your remarks: If module X  
imports module Y, do you expect translations of Y identifiers (and  
the name 'Y' itself) to be found in the X translation file? I reread  
you and it seems not, but I'd like to make absolutely sure.
Related question: Do you expect translation to carry between modules?  
Eg would translating file.close as fichier.fermer have an impact on  
anything named 'close' in any module (such as StringIO)?
I admit I did not think so initially: translations were module- 
specific in my mind. But I am now reconsidering; my initial position  
certainly makes duck typing un-intelligible. But applying  
translations broadly may be an issue in some cases, and introduce  
more ambiguity. I will think more about this, and I am curious about  
your thoughts. (I am mostly curious to see if this is how you saw  
things.)

> If you have a good solution for the problems I mention above, I'd  
> be happy to consider it. As it is, I'm not saying it's impossible,  
> but my feeling as I tried to code it initially was that if you try  
> to make things too "magic", eventually you're going to make the  
> wrong assumption and you're going to create some bugs in the code  
> being edited that are really really hard to track down.

I do not believe I introduce so much magic; so I am wondering whether  
there is a misunderstanding in some key assumptions that make what  
you attempted very different from what I am trying to explain.
But then, I must admit I did not try coding either!

> I'd rather keep things "simple" (my current version of the  
> translator module is over 200 LoC, by the time I add file IO for  
> the translation dictionaries and clean things up, I expect it will  
> be around 300 plus docstrings) so that it is at least possible for  
> a programmer who hasn't studied the code to have an intuitive grasp  
> of what's going on "under the hood" in their translator.

Agreed that simplicity is a worthy design goal.

>  And what is the benefit? Modules which are a "language soup",  
> maintained by an international coalition of children who can't even  
> talk to one another.
>
> I support the idea of international, cross-language programming  
> collaboration. However, I think that a basic assumption would be  
> that there were at least one common language of communication  
> between the participants, and that any given module has an owner  
> and thus a preferred language for all its still-untranslated  
> identifiers. If somebody wanted to "add to" that module, they'd  
> have to either write in that preferred language or use an explicit  
> import and subclass (or, of course, just do it in their language  
> for their own private use, because that file would then be  
> explicitly marked as "messy and suggested not for sharing")...

You are quite right, this is the crux of the argument: are there real  
use cases for it?

First, I agree with you that children who can read nothing about a  
module cannot use it; for example, we assume that the tool is useful  
to non-english speakers insofar as a translation is provided in a  
language they understand.
So we can assume that if people are collaborating on a module, there  
are common languages. Not necessarily one, however; If I write a  
module in French, a Belgian can provide a Flemish translation, that a  
South  African should be able to decipher and edit in Zulu. This does  
not require me to understand Zulu; and more important it does not  
require the South African to read French. I think that this would be  
a realistic benefit.
It does, however, require the Belgian kid to keep busy translating if  
I am to re-use the code from the Zulu-speaking child! How likely is  
that, realistically? You may be right, and that process may be one- 
way only. Yet again... In my view, it depends on the difficulty or  
ease of making and distributing translations (which is why I was  
thinking of distributed databases...)
Also, it depends on language community size... Suppose the module is  
written in English, and two people maintain a Spanish and a Russian  
translation independently... it is quite believable for improvements  
to be made in either language that could not be read by the other  
community; yet English as the hub makes it likely that they will see  
one another's improvements.

Let me summarize: my use case for doing this could be stronger; but I  
would argue it does exist, and for my part I am not yet convinced  
that identifier-specific language tags present such difficulties as  
you suggest. Again, I would like to understand better what problems  
you ran into (other than those above, which I think I mostly  
answered, save point 4.)

>> Another quick related note: What if someone adds a translation
>> between two non-English languages?
> I think that in this case, we could make an explicit English  
> placeholder, along the lines of "fr_une_fonction_in_English". Then  
> when the English is added later, our magic can clean things up.

My intent exactly ;-)

>  The "from the context of" is more a UI than a logical consideration.

That makes sense. Thank you for the clarification.

>> a) I believe there should be one translation file per language.  
>> More file pollution, less parsing.
>
> My current mockup is going to use csv for its dictionary file  
> format - Engish in the first column and one language per additional  
> column, in order of addition. Internally it is just a 2d array - a  
> list of lists. A "row" is only as long as it has to be and can have  
> any number of "empty" values as long as at least one value is  
> defined. The parsing on this file is, needless to say, trivial. It  
> would not be much harder if you converted it to XML, with one word  
> in one language per line, and you'd get more-fine-grained diffs for  
> free.

Hmmm... Here, allow me to insist. You are mentioning Garifuna, so I  
assume you are interested (as I am) in languages with a comparatively  
small population base. According to Ethnologue, there are > 6900  
living languages, of which >1200 are spoken by populations of 100K or  
more. Assume only half of that is represented; assume identifiers are  
16 bytes on average (not generous at all: multibyte in utf-8 often  
costs double, when not triple!); assume (round number) 50 public  
identifiers per module; assume 5 modules are imported, we're well  
over 2 Mb worth of parsing... I honestly think this is worth avoiding.

> OK, that is a valid point, it might be nice to follow getinfo  
> format to leverage existing tools. However, it is very useful in my  
> model that once a word exists in the dict for any language, it is  
> noticable from the perspective of all languages, even if it has not  
> been translated at all yet. This is a benefit of a single  
> multilanguage dict.

In my view, the list of all words is readily available in the module  
itself.

> I've been thinking about it. The problem, as you point out, is  
> hygiene; you need to build some whole new ratings/trust model (and  
> UI) for conflicting translations. Still, I think that if you make a  
> clean design for one person, adding this functionality later should  
> actually not be that sci-fi/hard. This would certainly give a  
> "critical mass" to the concept!

Rating systems are something I'm interested in. I will try to think  
of design.

> I have already implemented display-only disambiguation (and  
> reambiguation - consider when one English word has different French  
> translations in moduleX and moduleY, you need to call it  
> moduleX___voir___moduleY___vu...

Oh... is re-ambiguation how you handle duck typing?
OK... here lie dragons indeed. Still, very nice.
This convinces me you are using module-locale translation; which  
reassures me (less magic.)
But your solution, though elegant, feels hard to read... May I  
propose an alternative?
Pick one of the modules as canonic, and present the imports as follows:
from moduleX import voir
from moduleY import vu as voir
(this would be display-only.)
> I have found one case where you actually need to do a manual  
> disambig for every case of an identifier in your file. Say you are  
> importing an English color_module with the translation red -> rojo.  
> You're also importing an UNTRANSLATED spanish_networking_module  
> which uses the identifier red. Now you want to add the translation  
> network -> red to spanish_networking_module. If you were working in  
> Spanish, the editor could have warned you when you typed "red" and  
> helped you change it into "es___untranslated___red"
>
One possibility I thought about, which may help a bit: start with  
that. I.e. the intelligent editor (at least, though I realize we  
cannot rely on it) would write es___untranslated___red in the .py  
file in the first place.
That means less ambiguity in the actual .py file, esp. if it is used  
later in a non-language aware editor.
That said, the user will only see rojo and (spanish) red, right? No  
ambiguity there. The problem is if the user uses an untranslated  
english red alongside the new spanish red. Color cues become  
essential; also, the editor, as you point out, should beg that the  
english red be translated asap.

> I was unaware that there was any place in the world where first  
> numeracy was not in arabic numerals? I know that there are numeral  
> characters in Asian languages, but I thought that math was taught  
> in Arabic numerals even there - just as people here in Guatemala  
> learn base-20 Mayan numerals but don't use them day-to-day, even a  
> native Mayan speaker who doesn't speak Spanish will only speak in  
> Mayan numerals up to at most 19, beyond that they use Spanish with  
> Mayanized phonetics.
>
> Is this important and worth doing? Shouldn't be too hard if it is.

I cannot vouch for it either way. I am quite sure that there are  
regions where local numerals are taught first; but like you, I doubt  
they have not encountered so-called "arabic" numerals (westernized  
indic!) by the time they can use a computer. Still, I was flagging it  
as something worth researching, precisely.

Cheers,
Marc-Antoine