[Localization] New glossary options for eToys

Sat Jun 4 01:49:25 EDT 2011

Hi Chris (and other Sugar/OLPC/Etoys localizers),

First, as the author of poterminology, I'd like to say how happy I am to
see it being used, and ask that you not hesitate to mention any missing
features or bugs you might notice.  Although I have not been actively
maintaining poterminology (Alaa from translate.org.za has made all of
the recent changes) I still lurk around this list and the
Pootle/Translate Toolkit lists, and might be able to provide some useful
insight on any feature requests/bug reports, if not actually find time
to fix them.

Second, looking at the three proposed terminology files, the glossary3
version adds a lot of entries that really are more like phrases, and
which don't really add anything that useful.  If I had to choose from
the three proposals as-is, I would probably pick the glossary4 version,
although the glossary5 version is tempting as I think 250 terms is
probably closer to the sweet spot in terms of glossary utility than 400+
terms.  However, I think that with some changes to the poterminology
commands you are using, it is possible to improve the output and come up
with one or more terminology glossaries better than any of the current
three proposals.

In this I would first direct you to the Examples section of the
poterminology wiki page
(http://translate.sourceforge.net/wiki/toolkit/poterminology#examples)
which is also present in the manual page for poterminology.  There are a
number of hints there that may help you when using poterminology (e.g.
use the "--sort dictionary" option when generating multiple variants as
it allows for easier diffing of the results of the different variants).

I noticed in your command examples that you are using --term-words=7;
the longest phrase you are generating (in glossary3) has only five words
(ignoring stopwords like "a" and "the" that are not counted toward the
limit): "file choose another name cancel" and frankly none of the terms
with more than three words is really that useful - using such a large
--term-words limit just increases the memory requirements of
poterminology (and slows it down) in the initial stages of processing. 
Remembering that a term like "enter a new label" only counts as three
words due to stopword list processing, I think you can safely accept the
default of 3 rather than overriding it; if you see that you are losing
useful terms as a result, you could try --term-words=4 but I really see
no need to go any longer than that.  But this is really mostly a
performance issue, and not so much of a quality issue.

Varying just the minimum number of input files containing potential
terminology is not using the threshold filtering functionality of
poterminology to its fullest, and especially with such a large project
as Etoys you really need to increase the default settings of several
threshold filters.  Looking at the glossary3 file, I can see that
setting --locs-needed=5 (the default is 2) would reduce the output from
738 terms to 490, and while this would lose some useful terms relative
to the approximately equal-sized glossary4 file, I think it would be an
improvement overall.  I would recommend setting --locs-needed to 4 or 5
in your trial runs.  (FYI, I was able to perform this analysis by
looking at the #: lines indicating location and grepping for
"more+locations", which only appears when more than twice as many
locations as the threshold are present.)

You will also increase the impact of the glossary for a given size by
increasing the --substr-needed option from its default of 2; any value
of --substr-needed that is not greater than the --inputs-needed (or, in
most cases --locs-needed) thresholds is effectively a no-op.  Setting
this to values that are greater than both of those other thresholds will
eliminate terminology that only appears in a few translation items.  You
can run the following short Perl script on the poterminology output to
see the maximum --substr-needed value that would not eliminate each term:

#!/usr/bin/perl -n
$count += $1 if (/^# \(poterminology\) .* \((\d+)\)/);
print "$count msgid \"$1\"\n" if (/^msgid "([^"]+)"/);
$count = 0 if (/^msgstr /);

Running this small script on glossary3 shows that with
--inputs-needed=3, setting --substr-needed to the values from 3 to 11
would result in terminology glossaries of the following sizes:

--substr-needed=3   738 (value of 3 or lower is a no-op with
--inputs-needed=3)

--substr-needed=4   645

--substr-needed=5   550

--substr-needed=6   489

--substr-needed=7   434

--substr-needed=8   385

--substr-needed=9   349

--substr-needed=10  321

--substr-needed=11  289

For setting the threshold tuning values when generating the Etoys
glossary, I would suggest deciding on a maximum size limit for the
glossary (my opinion would be about 300, but you should get feedback
from others), and then for each of the six combinations of
--inputs-needed 3..4 and --locs-needed 4..6, increase --substr-needed
until the resulting glossary is under the maximum size limit.  We can
all then use diff to compare the results and see what we think is the
most useful generated glossary -- I actually used the following bash
pipeline to do diffs:

diff -u <(grep msgid etoys_glossary4.po|sort) <(grep msgid
etoys_glossary3.po|sort).

Finally, there are a few more advanced tricks that could be used to
automatically remove a few other less useful terminology entries.  One
thing that I noticed myself is that there are a dozen or so entries
(that are duplicates of other entries with the mere addition of the
extra word "in" (e.g. "code in") - it may be possible to eliminate those
by using an alternate stopword file where the line containing "=in" is
replaced with ">in" instead.  (The default stoplist is in
share/stoplist-en and you can specify an alternate stopword file with
-S/--stopword-list option, e.g. -S stoplist-en-etoys).  If you can try
this out and let me know if it works, I can commit this change as an
enhancement to the default stopword file - if it doesn't work, I'll
create a bug report indicating a possible solution that would allow that
stopword file change to actually suppress those entries.

There are probably a few other tricks with the stopword list that could
be used to improve the resulting terminology glossaries, but it is late
so I will stop here for now.  I hope that you find this helpful.

@alex

-- 
mailto:alex.dupuy at mac.com