[Localization] Revamping the glossary
Alexander Dupuy
alex.dupuy at mac.com
Tue Feb 26 18:00:55 EST 2008
I'm finally getting back to this after a week (and some).
Re-reviewing the glossary.pot that Sayamindu attached, I think I
understand now why both I and Edward Cherlin thought that it was the
Record activity POT. Without having seen the script used to generate
this file (you had said you would upload it after some refinement but we
haven't seen it) I would guess that it is extracting all messages that
appear in more than one POT file (or more than once in any file?),
possibly with some filtering for "short phrases" along the lines
suggested in
http://translate.sourceforge.net/wiki/toolkit/creating_a_terminology_list_from_your_existing_translations
and implemented in translate-toolkit/tools/poglossary.
Since all the messages at the top appear in the Record activity (and
only a few messages, towards the end, are not present in the Record
activity) it seemed to us that this was all there was (but we were
mistaken).
By using only complete phrases as they appear in other POT files, this
approach provides as much context as possible, but it doesn't fully
capture the common terminology present in the existing POT files, since
terms such as "Activity" or "Mesh Network" which appear in multiple
messages, in multiple POT files, are not present in your attached
glossary.pot, since there is no message which consists of just these words.
While there are cases where a single word isn't sufficient, e.g. in the
case of ticket #6439 that you mention
(http://dev.laptop.org/ticket/6439) having "zoom" in the glossary isn't
helpful for Spanish, since completely different terms are used for "zoom
in" ("acercarse") and "zoom out" ("alejarse"), I think that there is a
real need for "mining" of terms that are common substrings of multiple
messages. This is clearly a lot more work, and needs things like stop
lists to exclude words like "a" "the" "this" etcetera, but I think it is
worth the effort (and I'm actually willing to do some of the work on
this - it seems like the translate-toolkit utilities provide a useful
base for this - pocount.py is already doing the word break analysis
etcetera).
@alex
--
mailto:alex.dupuy at mac.com
More information about the Localization
mailing list