[Localization] eToys i18n analysis and suggestions

Alexander Dupuy alex.dupuy at mac.com
Fri Jun 17 15:27:31 EDT 2011


Hi Chris,


You wrote:

> This may seem trivial until you realize that (quit, Quit, QUIT and
> Quit:) all count as different strings to Pootle and to poterminology.
> This can multiply the number of strings for a localizer to translate
> and result in missed opportunities to include a common term in the
> Terminology project.
>   


Actually, poterminology does not need to consider capitalization
strictly, and by default it will merge some capitalization.  There are
three command line options that control this:


> -F, --fold-titlecase  	fold “Title Case” to lowercase (default)
> -C, --preserve-case   	preserve all uppercase/lowercase
> -I, --ignore-case     	make all terms lowercase
>

The default --fold-titlecase should convert "Quit" to "quit"
automatically - "QUIT" will remain unchanged (as it might be an acronym
with a different meaning).  Furthermore, the terminology generation will
automatically strip all leading or trailing punctuation such as colon
(:) in your example of Quit: so that the strings quit, Quit, and Quit:
will all be analyzed and output as quit by poterminology (even when they
appear in phrases of several words).  The titlecase folding is done on
words split by whitespace, so will not fold a capitalized hyphenated
word such as "Stop-gap" (although it would fold "Stop-Gap" to "stop-gap").


I know that Sayamindu initially ran poterminology with --preserve-case
to generate the current Sugar terminology project, but I think that was
the wrong choice; the capitalization is important in actual strings, but
rarely if ever when suggesting terminology - I would even suggest using
--ignore-case as there are not many acroynms used in Sugar (eToys might
be different).


In any case, Pootle, when using a terminology project, ignores case
entirely and will suggest all matching terminology regardless of upper
or lower case.  It will also suggest substrings, which can be a bit
misleading, e.g. 'top: arriba; encima ["on top"]' is suggested for
"stop" (along with 'stop: parar'); although it can occasionally be
helpful, e.g. 'record: grabar' is suggested for "recording" - a stricter
"word-prefix-only" substring matching might be better, although that
would suppress suggestions for words with prefixes like de-, un-, dis-,
re- etcetera.


Having said this, I suspect that many of your "merge" suggestions for
the eToys strings may still be worthwhile - harmonizing the
capitalization and punctuation can still reduce the string count and
improve other sources of suggestions, like Translation Memory, which may
not be as case-insensitive as terminology.



@alex

-- 
mailto:alex.dupuy at mac.com



More information about the Localization mailing list