[Localization] Revamping the glossary

Alexander Dupuy alex.dupuy at mac.com
Tue Feb 26 18:00:55 EST 2008


I'm finally getting back to this after a week (and some).

Re-reviewing the glossary.pot that Sayamindu attached, I think I 
understand now why both I and Edward Cherlin thought that it was the 
Record activity POT.  Without having seen the script used to generate 
this file (you had said you would upload it after some refinement but we 
haven't seen it) I would guess that it is extracting all messages that 
appear in more than one POT file (or more than once in any file?), 
possibly with some filtering for "short phrases" along the lines 
suggested in 
http://translate.sourceforge.net/wiki/toolkit/creating_a_terminology_list_from_your_existing_translations 
and implemented in translate-toolkit/tools/poglossary.
Since all the messages at the top appear in the Record activity (and 
only a few messages, towards the end, are not present in the Record 
activity) it seemed to us that this was all there was (but we were 
mistaken).

By using only complete phrases as they appear in other POT files, this 
approach provides as much context as possible, but it doesn't fully 
capture the common terminology present in the existing POT files, since 
terms such as "Activity" or "Mesh Network" which appear in multiple 
messages, in multiple POT files, are not present in your attached 
glossary.pot, since there is no message which consists of just these words.

While there are cases where a single word isn't sufficient, e.g. in the 
case of ticket #6439 that you mention 
(http://dev.laptop.org/ticket/6439) having "zoom" in the glossary isn't 
helpful for Spanish, since completely different terms are used for "zoom 
in" ("acercarse") and "zoom out" ("alejarse"), I think that there is a 
real need for "mining" of terms that are common substrings of multiple 
messages.  This is clearly a lot more work, and needs things like stop 
lists to exclude words like "a" "the" "this" etcetera, but I think it is 
worth the effort (and I'm actually willing to do some of the work on 
this - it seems like the translate-toolkit utilities provide a useful 
base for this - pocount.py is already doing the word break analysis 
etcetera).

@alex

-- 
mailto:alex.dupuy at mac.com



More information about the Localization mailing list