[Localization] slightly long and detailed proposal for documentation-translation workflow

Mon Oct 15 17:20:30 EDT 2007

I sent these ideas to Jim Gettys, who suggested that I send them to
the development and localization mailing lists.

------
Summary:

    * Write/ Edit primary documentation according to an explicit set
of writing conventions designed to minimize ambiguity and complexity
in order to facilitate translation.
    * Treat this English documentation as source code which is meant
to be translated/compiled into user languages.
    * Use/Create collaboration tools to make translation,
distribution, and maintenance of docs more efficient.

------
Assumptions:

Some of those doing translation will not be professional translators
fully bilingual in English and the target language. They might be any
of the following:

    * a village teacher who speaks the target language as her first
language (L1) and English as a weak second language (L2);
    * a missionary who speaks English as L1 or L2 (in the case of a
French missionary in Africa, for example) and the target language as a
weak L3;
    * a professional translator who speaks a non-English L1, reads and
writes the target language as L2, and knows English as just a subject
that he or she studied in school and uses for travel;
    * a native L1 speaker of the target language who has immigrated to
a foreign country in which English is spoken as a primary or secondary
language.

Many of the translators are not going to be career translators, so
rather than having the translator accommodate the source text, the
source text should accommodate the translator.

Documentation translation is particularly difficult because of how
documentation is usually created. Often docs are written grudgingly at
the end of the project, and docs are rarely written to a uniform
format or set of conventions. There is little reflection on what kind
of docs are needed, and docs are usually not edited before they are
sent off for transl and publishing. The conventional approach to
translation is that, when a novel or academic article is translated,
it is the burden of the translator to accommodate the original, and if
the original is unclear, this lack of clarity is translated into the
target texts because the target text must be a mirror of the original.
I know this from direct experience, having been the translator for
many doc jobs from Japanese companies. The originals are often
incomprehensible because of ambiguity and inconsistency, as in the
following examples:

    * different sections of the docs are written by different people
using different terminology for the same processes and entities;
    * unconfident writers are too brief, assuming background info and
context to which the translator does not have access;
    * more confident writers use too many idioms and colorful
expressions, rambling on and on in extended and poorly-organized
complex sentences;
    * section divisions and overall organization are inconsistent,
forcing the translator to restructure the original before beginning
the translation;
    * ambiguities inherent in the language itself (like the absence of
gendered pronouns and explicit sentence subjects in Japanese) also
complicate the translation, forcing the translator to contact the
writer of the original, thus slowing the process and degrading
translator motivation and confidence.

Ambiguity is the biggest obstacle to translation. If it is a rush job
(and it always is), and especially if the translation is being handled
by a middleman like a publisher or web design firm (and these days it
almost always is), the translator usually retreats to literal
translation in the face of ambiguity because there is no way to
contact the author (middlemen don't want the translator to know how
much the client is being billed for translation) or no time to wait
for the reply. When the text is unclear, the translator has no choice
but to translate the ambiguity itself. In the case of OLPC
documentation, ambiguity should be avoided at all costs. Anything that
interferes with teachers and students using the notebooks should be
avoided, and bad docs would certainly be frustrating and demotivating
for the educators and pupils. In order to have translations that are
as clear as possible, we must have source-docs that are as clear as
possible.

------
Reconception of documentation/ translation as parallel to computer programming:

The OLPC team uses English as a common working language, but the users
will be using translations, so the English documentation can be seen
as not a product in and of itself but as the source for all
translations. The English-language "source docs" should be written to
a set of conventions meant to reduce ambiguity and ensure consistency,
even when doing so necessitates violating conventional English writing
style. The set of documentation standards I am proposing is similar to
the set of coding conventions a programmer follows. The "source docs"
(though written in English) should be seen as source code which is
then compiled (or translated) into the many languages needed to
support the users. Likewise, the source-docs should include explicit
comments and extra-textual blocks to clarify ambiguity introduced by
the writing style or inherent in the language itself, much in the same
way that a good programmer includes comments in source code to
compensate for the lack of explanatory devices in the code itself.
Looping through a multi-array doesn't tell you WHY you need to do so
or how it plays into the next code block, just as being told that the
subject of a sentence is "Suzuki-san" does not tell you if Suzuki is a
"she" or a "he". Most techs have had the experience of having to
maintain a code base which did not include sufficient comments: while
"read the friendly code" or "use the source" might be good ways to
learn to program, this kind of detective work is not an efficient use
of time and effort.

------
Doc writing conventions:

Some linguistic research has been done on "simplified English" as a
subset of English to use for low-level learners, and I think that it
might be a good place to look for ways to simplify the source_docs.
But just thinking intuitively, I have cooked up the following
suggestions in order to generate discussion:

    * Pronouns.
          o Use the first-person singular pronoun "I" to represent the
author of the docs,
          o the second-person singular pronoun "you" to represent the
reader of the docs, and
          o the first-person plural pronoun "we" to represent the OLPC project.
          o Examples. "We have designed a screen that switches to
black-and-white to conserve energy. I will explain how to switch your
screen to black-and-white. First, you press the X button on your
keyboard...." Because we want the docs to be easily translated and
easily understood, the tone should be personal, using "I" for the
voice of the writer. This will be easier for amateur translators to
translate and easier for younger readers to understand. This will also
help the writer avoid the passive construction, which is very
difficult for some non-native English speakers to understand.
    * Lists.
          o Use tables to explain parallel relationships, comparisons,
the composition of an entity, and categorical relationships.
          o Use numbered lists to explain the stages of a process, the
steps in a sequence, or anything that has an inherent spatial or
temporal order or expresses precedence. Do not use numbered lists if
the numbers do not relate to some inherent property of the items. A
grocery list should not be numbered, unless the order in which the
items are purchased is important.
          o Use bulleted lists for lists that do not have inherent
order or precedence. The grocery list would be bulleted.
    * All comma sequences should have a comma before the last
conjunction, i.e. "I like to read books, eat shrimp, and run
marathons," rather than, "I like to read books, eat shrimp and run
marathons." It is fashionable right now to leave out the last comma,
but doing so puts the onus of comprehension on the reader. While this
is a nit-picky detail, OLPC source-docs should do as much of the work
as possible so that translation and comprehension are as easy as
possible.
    * Use parentheses to include supplemental information like the
gender of human agents, steps in a sequence, the target of a pronoun,
etc. when there is any ambiguity.
    * Many languages, including Japanese, represent non-native names
in a native writing system. In Japanese, foreign names are written in
a phonetic script called katakana, and my name is pronounced Kuupaa
Maikeru. The result is that there is a loss of data; the orthography
of my name (the spelling in English) is lost to any
Japanese-to-English translator, as is the proper pronunciation. I
suggest that all source-docs have personal names written in the
alphabet and followed by the pronunciation written in IPA
(International Phonetic Alphabet) in parentheses behind it. Then
translators should be told to always put the original orthography in
parentheses after the name that they are using, so that my name would
be "<katakana>Kuupaa Maikeru</katakana> (<alpha>Micheal
Cooper</alpha>)" in a Japanese translation.
    * Insert a table that acts as a glossary of terms and their
definitions at the beginning of each text. These would be the key
nouns and verbs used in the text, terms that need to have clear
meanings and consistent translations. The translators would be
required to keep culminative lists in OO Calc or such of these key
terms so that, in the case that the translator changes or a group of
translators is doing the job, the key terms can be kept consistent. If
we know ahead of time that there will be translator teams, this could
be covered by a webapp or by Google spreadsheets.
    * Idioms and culture-specific metaphors and references should be
avoided or used sparingly. Of course, terminology that originated in
cultural metaphor, like "kill a process" and "reboot the server" would
be treated as key terms and added to the glossary to be translated
consistently, but more creative and expressive language ("you can type
like a banshee", "students will be on it like white on rice",
"resulting in a Mickey Mouse, vanilla solution to the problem") should
be curtailed.
    * Use words, mathematical symbols, and visuals to reinforce and
enhance purely verbal explanations with conceptual representations of
information (I am thinking Edward Tufte here), i.e. (poor example, but
here goes) "I will show you how to teach your students to create
multimedia presentations. <in box> Sound + Pictures = Multimedia </in
box>." I think you get the idea, though.
    * The source-docs be organized so that each section and each
paragraph is identified by a number and that the translators be
required to maintain this organization so that paragraph 61 in the
Yoruba translation is paragraph 61 in the source-docs. By doing so, it
will be easier to modify the translations when changes are made to the
source-docs. This would imply some kind of web-based app to store and
manage the docs. I am looking at the way we translate in my
organization and thinking about what would be a good online tool to
coordinate translations. There are many proprietary tools with vast
hoards of features and complications which cost 1-2 thousand dollars
per user, but they are not suitable for OLPC. I think OLPC docs-trans
would do well with a lighter, simpler application. If the list doesn't
mind, I would like to post the resulting thoughts at a later date so
that there can be an exchange of ideas.

I apologize for the length, and I hope these ideas can be of help.

Micheal Cooper, Japan