[Localization] Revamping the glossary - poterminology

Alexander Dupuy alex.dupuy at mac.com
Tue Mar 18 16:22:49 EDT 2008


I previously sent this last week to localization at lists.laptop.org but it was 
bounced as too large, due to the sample glossary.  I'm re-sending it in two 
parts (and CC'ing the translate-toolkit developers list on the first part as well).

As background for those on the translate-devel list, OLPC (One Laptop Per 
Child) project is using Pootle to localize applications (activities) for the XO 
children's' laptop; currently we are using a manually created terminology 
glossary for Pootle, but it is very small and we wanted to create a larger 
terminology file.  I pointed Sayamindu at the translate.sourceforge.net wiki 
page on creating a glossary, and actually used the poglossary program on our 
source files, but found both approaches to be pretty lacking.

...

Sayamindu Dasgupta wrote:
>> Thanks for the pointer to the translate.sourceforge.net page. I think
>> that the translate project folks have already managed to achieve what
>> I was trying to do.
>> I have followed the steps outlined in that page - and I'm attaching
>> the output. Does the attached file look more comprehensive ?

I replied:
> It's certainly more comprehensive, but it goes a bit far the other 
> way, including a lot of terms that only appear once, and stuff that 
> not only doesn't need to be in the Pootle Terminology, but that 
> probably shouldn't even be translated
> Did you use the poglossary  bash script in translate-toolkit-1.1.0?  
> It seems to automate the steps in their documentation, so that you can 
> run it as a single script rather than a whole sequence of commands.
>
> However, this would appear to suffer from most of the same problems as 
> noted above, and a few others.  Running it on the Spanish PO files 
> downloaded from Pootle as packaging-es.zip pootle-es.zip 
> update1-es.zip xo_bundled-es.zip gives 787 msgids in 3621 lines (your 
> version has 605 msgids in 2467 lines).  So it's not really a panacea, 
> but does provide a convenient starting point for simple modifications.
>
> The word counting approach I described earlier is more complex, but 
> should provide better results (although it will require significantly 
> more coding).

Not quite two weeks later, I have a pretty mature version of the word
counting program I was thinking of, although it's written in Python
rather than Bash, and I used poconflicts.py as a starting point rather
than pocount.py or poglossary.  Running it over the 24 Spanish PO files
downloaded from Pootle as act_server-es.zip packaging-es.zip
pootle-es.zip terminology-es.zip update1-es.zip xo_bundled-es.zip (note
that this includes the original glossary.v.0.1.po) gives the following
results:

$ poterminology -F -S stoplist-en *.po new-glossary.po
processing 24 files...
[###########################################] 100%
3279 terms from 2953 units in 24 files
355 terms after thresholding
352 terms after subphrase reduction

$ time poterminology -F -S stoplist-en *.po ../etoys/etoys.po
glossary-with-etoys.po
processing 25 files...
[###########################################] 100%
13295 terms from 6614 units in 25 files
628 terms after thresholding
625 terms after subphrase reduction

real    0m13.245s
user    0m13.144s
sys     0m0.092s

I've attached the generated Spanish glossary (without the etoys.po)
[NOTE - now attached to separate message to avoid length limits]
along with the poterminology and poterminology.py scripts (they can be
installed in /usr/bin/ and
/usr/lib/python2.*/site-packages/translate/tools/ and are intended for
eventual incorporation into the translate-toolkit package (I'm working
with version 1.1.0).  Also attached is stoplist-en, which is a list of
stop-words (and regex patterns) for English that do a pretty good job of
eliminating the junk like numbers etcetera.  The format of this file is
not yet described in the source code or manual page, but a comment at
the beginning of the stoplist-en file describes the format well enough
for someone to tinker with it.

While the results aren't perfect, they are pretty reasonable, and have a
couple of advantages over both the current glossary and the results of
poglossary or the equivalent manual steps:

* most (although not all) of the existing terminology glossary is
preserved; the entries dropped are obsolete (e.g. "Web activity" - this
is now "Browse activity") or only appear in one place (or not at all) in
the actual system POT/PO files

* terminology is extracted not only from short messages, but also from
much longer ones, like those in the activation server and the "stuffer
sheet"

* default sorting by frequency puts the most commonly used terms first,
which is helpful for incomplete translations of the terminology glossary
(people seem to start at the beginning and work forward) [other sorting
options are available in poterminology, dictionary/alphabetic, and by
string length]

* only uses strings that are actually present in the source PO/POTs -
the current glossary has entries like "Paint (activity)" which Pootle
will not display for a string containing "paint" (without the
"(activity)") - a better way to mark multiple meanings for an English
term is in the translation (and, eventually, in the source code notes) -
a good example of this in the new-glossary.po is the entry for "play"
reproduced here in full:

# (poterminology) connect-activity.po (1)
# (poterminology) TamTamSynthLab.activity.po (3)
# (poterminology) record-activity.po (1)
# (poterminology) TamTamMini.activity.po (2)
# (poterminology) memorize.po (1)
# (poterminology) TamTamJam.activity.po (2)
# (poterminology) TamTamEdit.activity.po (3)
# (poterminology) slider-puzzle.po (1)
#: constants.py
#: activity.py
#: Edit/EditToolbars.py
#: Mini/miniTamTamMain.py
#: mmm_modules/buddy_panel.py
#: SynthLab/SynthLabConstants.py
#: common/Resources/tooltips_en.py
#: /home/msgodoi/olpc/workspace/Memorize.activity/activity.py
#, fuzzy
msgid "play"
msgstr ""
"jugar {memorize.po}; reproducir {record-activity.po,
TamTamEdit.activity.po, "
"TamTamJam.activity.po, TamTamMini.activity.po, TamTamSynthLab.activity.po}"

The default output of poterminology marks conflicting translations with
the input .po file - in this case the verb "play" has different
translations for playing a game or playing a recording or musical
piece.  The entry is marked as fuzzy, and a reviewer could edit the
translation to something like:

msgstr "jugar {play a game}; reproducir {play a recording}"

With a terminology translation like this, Pootle will display both valid
translations for any string containing the English word "play" (or
"player" or "played" or "foreplay" - but that's a Pootle bug/feature)
and the translator should be able to figure out which of the different
choices is most appropriate.

With some experience in different languages, it may eventually make
sense to add a "source-code" note along the lines of the following,
which would be displayed when editing the terminology entry in any
language, even where it has not yet been translated:

#. {play a game}; {play a recording}

(Doing this would not be possible in Pootle itself, but could be
incorporated into an improved version of the poterminology script.)

I'd be interested in feedback on this, especially results running it on
different language PO sets (or original POT files) - the generated
terminology should be the same, but there could be bugs.  Also, any
results that you get from tweaking the stoplist entries and/or the
poterminology threshold options would be of interest as well.  Once we
have some experience with this script (and I've had time to add a few
more features, like generating the "source-code" notes from a
human-edited glossary, and an --update option that updates the existing
glossary, using it as both input and output file) I plan to submit this
to the upstream translation-toolkit site for inclusion in the next release.

@alex

-- 
mailto:alex.dupuy at mac.com


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: poterminology
Url: http://lists.laptop.org/pipermail/localization/attachments/20080318/3c0ad7ec/attachment-0002.txt 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: poterminology.py
Type: text/x-python
Size: 17781 bytes
Desc: not available
Url : http://lists.laptop.org/pipermail/localization/attachments/20080318/3c0ad7ec/attachment-0001.py 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: stoplist-en
Url: http://lists.laptop.org/pipermail/localization/attachments/20080318/3c0ad7ec/attachment-0003.txt 


More information about the Localization mailing list