[Localization] New glossary options for eToys

Mon Jun 6 00:08:53 EDT 2011

Alex,

Thank you for writing poterminology and thank you for your thoughtful
response and analysis.  It will take me a little time to digest it,
but I will make further adjustments to the parameters as you suggest
and recalculate some candidate glossaries for eToys.

As for the potential performance cost of using --term-words=7, I can
only say that I intentionally used a longer upper limit to not exclude
the possibility of finding some longer repeated phrases at the higher
end of the distribution curve.  On my modest home system there was no
significant processing delay.  On jobs like this that are not run
frequently (perhaps once a year?), the difference between completing
in 3 seconds and 30 seconds is more-or-less meaningless.  I may have a
distorted perspective on this given my training as a computational
biologist .  Back in the day, I would routinely spin up a 20 processor
SGI Origin 2000 for week-long runs to trace gene families back to the
Last Common Ancestor.

http://www.ncbi.nlm.nih.gov/pubmed?term=9799791

There are a great many similarities between the parameter adjustments
you discuss and those used in conducting sequence analysis, and
altjough it's been a while since I've had to think deeply about such
concepts,  it will be a pleasure to dust off those neural circuits
once again :-)

Warmest Regards,

cjl

On Sat, Jun 4, 2011 at 1:49 AM, Alexander Dupuy <alex.dupuy at mac.com> wrote:
> Hi Chris (and other Sugar/OLPC/Etoys localizers),
>
>
> First, as the author of poterminology, I'd like to say how happy I am to
> see it being used, and ask that you not hesitate to mention any missing
> features or bugs you might notice.  Although I have not been actively
> maintaining poterminology (Alaa from translate.org.za has made all of
> the recent changes) I still lurk around this list and the
> Pootle/Translate Toolkit lists, and might be able to provide some useful
> insight on any feature requests/bug reports, if not actually find time
> to fix them.
>
>
> Second, looking at the three proposed terminology files, the glossary3
> version adds a lot of entries that really are more like phrases, and
> which don't really add anything that useful.  If I had to choose from
> the three proposals as-is, I would probably pick the glossary4 version,
> although the glossary5 version is tempting as I think 250 terms is
> probably closer to the sweet spot in terms of glossary utility than 400+
> terms.  However, I think that with some changes to the poterminology
> commands you are using, it is possible to improve the output and come up
> with one or more terminology glossaries better than any of the current
> three proposals.
>
>
> In this I would first direct you to the Examples section of the
> poterminology wiki page
> (http://translate.sourceforge.net/wiki/toolkit/poterminology#examples)
> which is also present in the manual page for poterminology.  There are a
> number of hints there that may help you when using poterminology (e.g.
> use the "--sort dictionary" option when generating multiple variants as
> it allows for easier diffing of the results of the different variants).
>
>
> I noticed in your command examples that you are using --term-words=7;
> the longest phrase you are generating (in glossary3) has only five words
> (ignoring stopwords like "a" and "the" that are not counted toward the
> limit): "file choose another name cancel" and frankly none of the terms
> with more than three words is really that useful - using such a large
> --term-words limit just increases the memory requirements of
> poterminology (and slows it down) in the initial stages of processing.
> Remembering that a term like "enter a new label" only counts as three
> words due to stopword list processing, I think you can safely accept the
> default of 3 rather than overriding it; if you see that you are losing
> useful terms as a result, you could try --term-words=4 but I really see
> no need to go any longer than that.  But this is really mostly a
> performance issue, and not so much of a quality issue.
>
>
> Varying just the minimum number of input files containing potential
> terminology is not using the threshold filtering functionality of
> poterminology to its fullest, and especially with such a large project
> as Etoys you really need to increase the default settings of several
> threshold filters.  Looking at the glossary3 file, I can see that
> setting --locs-needed=5 (the default is 2) would reduce the output from
> 738 terms to 490, and while this would lose some useful terms relative
> to the approximately equal-sized glossary4 file, I think it would be an
> improvement overall.  I would recommend setting --locs-needed to 4 or 5
> in your trial runs.  (FYI, I was able to perform this analysis by
> looking at the #: lines indicating location and grepping for
> "more+locations", which only appears when more than twice as many
> locations as the threshold are present.)
>
>
> You will also increase the impact of the glossary for a given size by
> increasing the --substr-needed option from its default of 2; any value
> of --substr-needed that is not greater than the --inputs-needed (or, in
> most cases --locs-needed) thresholds is effectively a no-op.  Setting
> this to values that are greater than both of those other thresholds will
> eliminate terminology that only appears in a few translation items.  You
> can run the following short Perl script on the poterminology output to
> see the maximum --substr-needed value that would not eliminate each term:
>
>
> #!/usr/bin/perl -n
> $count += $1 if (/^# \(poterminology\) .* \((\d+)\)/);
> print "$count msgid \"$1\"\n" if (/^msgid "([^"]+)"/);
> $count = 0 if (/^msgstr /);
>
>
> Running this small script on glossary3 shows that with
> --inputs-needed=3, setting --substr-needed to the values from 3 to 11
> would result in terminology glossaries of the following sizes:
>
>
> --substr-needed=3   738 (value of 3 or lower is a no-op with
> --inputs-needed=3)
>
> --substr-needed=4   645
>
> --substr-needed=5   550
>
> --substr-needed=6   489
>
> --substr-needed=7   434
>
> --substr-needed=8   385
>
> --substr-needed=9   349
>
> --substr-needed=10  321
>
> --substr-needed=11  289
>
>
> For setting the threshold tuning values when generating the Etoys
> glossary, I would suggest deciding on a maximum size limit for the
> glossary (my opinion would be about 300, but you should get feedback
> from others), and then for each of the six combinations of
> --inputs-needed 3..4 and --locs-needed 4..6, increase --substr-needed
> until the resulting glossary is under the maximum size limit.  We can
> all then use diff to compare the results and see what we think is the
> most useful generated glossary -- I actually used the following bash
> pipeline to do diffs:
>
> diff -u <(grep msgid etoys_glossary4.po|sort) <(grep msgid
> etoys_glossary3.po|sort).
>
>
> Finally, there are a few more advanced tricks that could be used to
> automatically remove a few other less useful terminology entries.  One
> thing that I noticed myself is that there are a dozen or so entries
> (that are duplicates of other entries with the mere addition of the
> extra word "in" (e.g. "code in") - it may be possible to eliminate those
> by using an alternate stopword file where the line containing "=in" is
> replaced with ">in" instead.  (The default stoplist is in
> share/stoplist-en and you can specify an alternate stopword file with
> -S/--stopword-list option, e.g. -S stoplist-en-etoys).  If you can try
> this out and let me know if it works, I can commit this change as an
> enhancement to the default stopword file - if it doesn't work, I'll
> create a bug report indicating a possible solution that would allow that
> stopword file change to actually suppress those entries.
>
>
> There are probably a few other tricks with the stopword list that could
> be used to improve the resulting terminology glossaries, but it is late
> so I will stop here for now.  I hope that you find this helpful.
>
>
> @alex
>
> --
> mailto:alex.dupuy at mac.com
>
> _______________________________________________
> Localization mailing list
> Localization at lists.laptop.org
> http://lists.laptop.org/listinfo/localization
>