[Localization] [Translate-devel] Revamping the glossary - poterminology

Dwayne Bailey dwayne at translate.org.za
Tue Mar 25 06:32:52 EDT 2008


Hi,

This is great.  I spent a bit of time testing it on some of our
translations and it worked really well.   Now I can finally throw away
poglossary, yay may she RIP!  Many thanks and as Friedel said thanks for
building this on top of Translate Toolkit.

On Tue, 2008-03-18 at 16:22 -0400, Alexander Dupuy wrote:
> I previously sent this last week to localization at lists.laptop.org but it was 
> bounced as too large, due to the sample glossary.  I'm re-sending it in two 
> parts (and CC'ing the translate-toolkit developers list on the first part as well).
> 
> As background for those on the translate-devel list, OLPC (One Laptop Per 
> Child) project is using Pootle to localize applications (activities) for the XO 
> children's' laptop; currently we are using a manually created terminology 
> glossary for Pootle, but it is very small and we wanted to create a larger 
> terminology file.  I pointed Sayamindu at the translate.sourceforge.net wiki 
> page on creating a glossary, and actually used the poglossary program on our 
> source files, but found both approaches to be pretty lacking.
> 
> ...
> 
> Sayamindu Dasgupta wrote:
> >> Thanks for the pointer to the translate.sourceforge.net page. I think
> >> that the translate project folks have already managed to achieve what
> >> I was trying to do.
> >> I have followed the steps outlined in that page - and I'm attaching
> >> the output. Does the attached file look more comprehensive ?
> 
> I replied:
> > It's certainly more comprehensive, but it goes a bit far the other 
> > way, including a lot of terms that only appear once, and stuff that 
> > not only doesn't need to be in the Pootle Terminology, but that 
> > probably shouldn't even be translated
> > Did you use the poglossary  bash script in translate-toolkit-1.1.0?  
> > It seems to automate the steps in their documentation, so that you can 
> > run it as a single script rather than a whole sequence of commands.
> >
> > However, this would appear to suffer from most of the same problems as 
> > noted above, and a few others.  Running it on the Spanish PO files 
> > downloaded from Pootle as packaging-es.zip pootle-es.zip 
> > update1-es.zip xo_bundled-es.zip gives 787 msgids in 3621 lines (your 
> > version has 605 msgids in 2467 lines).  So it's not really a panacea, 
> > but does provide a convenient starting point for simple modifications.
> >
> > The word counting approach I described earlier is more complex, but 
> > should provide better results (although it will require significantly 
> > more coding).
> 
> Not quite two weeks later, I have a pretty mature version of the word
> counting program I was thinking of, although it's written in Python
> rather than Bash, and I used poconflicts.py as a starting point rather
> than pocount.py or poglossary.  Running it over the 24 Spanish PO files
> downloaded from Pootle as act_server-es.zip packaging-es.zip
> pootle-es.zip terminology-es.zip update1-es.zip xo_bundled-es.zip (note
> that this includes the original glossary.v.0.1.po) gives the following
> results:
> 
> $ poterminology -F -S stoplist-en *.po new-glossary.po
> processing 24 files...
> [###########################################] 100%
> 3279 terms from 2953 units in 24 files
> 355 terms after thresholding
> 352 terms after subphrase reduction
> 
> $ time poterminology -F -S stoplist-en *.po ../etoys/etoys.po
> glossary-with-etoys.po
> processing 25 files...
> [###########################################] 100%
> 13295 terms from 6614 units in 25 files
> 628 terms after thresholding
> 625 terms after subphrase reduction
> 
> real    0m13.245s
> user    0m13.144s
> sys     0m0.092s
> 
> I've attached the generated Spanish glossary (without the etoys.po)
> [NOTE - now attached to separate message to avoid length limits]
> along with the poterminology and poterminology.py scripts (they can be
> installed in /usr/bin/ and
> /usr/lib/python2.*/site-packages/translate/tools/ and are intended for
> eventual incorporation into the translate-toolkit package (I'm working
> with version 1.1.0).  Also attached is stoplist-en, which is a list of
> stop-words (and regex patterns) for English that do a pretty good job of
> eliminating the junk like numbers etcetera.  The format of this file is
> not yet described in the source code or manual page, but a comment at
> the beginning of the stoplist-en file describes the format well enough
> for someone to tinker with it.
> 
> While the results aren't perfect, they are pretty reasonable, and have a
> couple of advantages over both the current glossary and the results of
> poglossary or the equivalent manual steps:
> 
> * most (although not all) of the existing terminology glossary is
> preserved; the entries dropped are obsolete (e.g. "Web activity" - this
> is now "Browse activity") or only appear in one place (or not at all) in
> the actual system POT/PO files
> 
> * terminology is extracted not only from short messages, but also from
> much longer ones, like those in the activation server and the "stuffer
> sheet"
> 
> * default sorting by frequency puts the most commonly used terms first,
> which is helpful for incomplete translations of the terminology glossary
> (people seem to start at the beginning and work forward) [other sorting
> options are available in poterminology, dictionary/alphabetic, and by
> string length]
> 
> * only uses strings that are actually present in the source PO/POTs -
> the current glossary has entries like "Paint (activity)" which Pootle
> will not display for a string containing "paint" (without the
> "(activity)") - a better way to mark multiple meanings for an English
> term is in the translation (and, eventually, in the source code notes) -
> a good example of this in the new-glossary.po is the entry for "play"
> reproduced here in full:
> 
> # (poterminology) connect-activity.po (1)
> # (poterminology) TamTamSynthLab.activity.po (3)
> # (poterminology) record-activity.po (1)
> # (poterminology) TamTamMini.activity.po (2)
> # (poterminology) memorize.po (1)
> # (poterminology) TamTamJam.activity.po (2)
> # (poterminology) TamTamEdit.activity.po (3)
> # (poterminology) slider-puzzle.po (1)
> #: constants.py
> #: activity.py
> #: Edit/EditToolbars.py
> #: Mini/miniTamTamMain.py
> #: mmm_modules/buddy_panel.py
> #: SynthLab/SynthLabConstants.py
> #: common/Resources/tooltips_en.py
> #: /home/msgodoi/olpc/workspace/Memorize.activity/activity.py
> #, fuzzy
> msgid "play"
> msgstr ""
> "jugar {memorize.po}; reproducir {record-activity.po,
> TamTamEdit.activity.po, "
> "TamTamJam.activity.po, TamTamMini.activity.po, TamTamSynthLab.activity.po}"
> 
> The default output of poterminology marks conflicting translations with
> the input .po file - in this case the verb "play" has different
> translations for playing a game or playing a recording or musical
> piece.  The entry is marked as fuzzy, and a reviewer could edit the
> translation to something like:
> 
> msgstr "jugar {play a game}; reproducir {play a recording}"
> 
> With a terminology translation like this, Pootle will display both valid
> translations for any string containing the English word "play" (or
> "player" or "played" or "foreplay" - but that's a Pootle bug/feature)
> and the translator should be able to figure out which of the different
> choices is most appropriate.
> 
> With some experience in different languages, it may eventually make
> sense to add a "source-code" note along the lines of the following,
> which would be displayed when editing the terminology entry in any
> language, even where it has not yet been translated:
> 
> #. {play a game}; {play a recording}
> 
> (Doing this would not be possible in Pootle itself, but could be
> incorporated into an improved version of the poterminology script.)
> 
> I'd be interested in feedback on this, especially results running it on
> different language PO sets (or original POT files) - the generated
> terminology should be the same, but there could be bugs.  Also, any
> results that you get from tweaking the stoplist entries and/or the
> poterminology threshold options would be of interest as well.  Once we
> have some experience with this script (and I've had time to add a few
> more features, like generating the "source-code" notes from a
> human-edited glossary, and an --update option that updates the existing
> glossary, using it as both input and output file) I plan to submit this
> to the upstream translation-toolkit site for inclusion in the next release.
> 
> @alex
> 
> plain text document attachment (poterminology)
> #!/usr/bin/python
> # 
> # Copyright 2008 Zuza Software Foundation
> # 
> # This file is part of the translate-toolkit
> #
> # translate is free software; you can redistribute it and/or modify
> # it under the terms of the GNU General Public License as published by
> # the Free Software Foundation; either version 2 of the License, or
> # (at your option) any later version.
> # 
> # translate is distributed in the hope that it will be useful,
> # but WITHOUT ANY WARRANTY; without even the implied warranty of
> # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> # GNU General Public License for more details.
> #
> # You should have received a copy of the GNU General Public License
> # along with translate; if not, write to the Free Software
> # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
> 
> """reads a set of .po or .pot files to produce a pootle-terminology.pot"""
> 
> from translate.tools import poterminology
> 
> if __name__ == '__main__':
>   poterminology.main()
> 
> 
> plain text document attachment (stoplist-en)
> #
> # Stoplist of common English words to exclude from terminology
> #
> # Lines beginning with a '#' are comments
> #
> # Lines beginning with a '/' are regex patterns - any word that matches will
> # be ignored by itself, and any phrase containing it will be ignored as well
> # (note, regex patterns will NOT be checked if an exact word match is found)
> #
> # All other lines should begin with one of the following characters, which
> # indicate whether the word should be ignored (as a word alone), disregarded
> # in a phrase (i.e. phrase containing it is allowed, and word does not count
> # against --term-word-length limit), or any phrase containing it should be
> # ignored. (Note: use of '+' is only needed for exceptions to regex patterns.)
> #
> # '+' allow word alone, allow phrases containing it (regardless of regex match)
> # ':' allow word alone, disregarded (for --term-word-length) inside phrase
> # '<' allow word alone, but ignore any phrase containing it
> # '=' ignore word alone, but allow phrases containing it
> # '>' ignore word alone, disregarded (for --term-word-length) inside phrase
> # '@' ignore word alone, and ignore any phrase containing it
> #
> +no
> /.*\(.*
> /..?
> /[0-9,.]+(st|nd|rd)?
> :off
> :on
> <first
> <hello
> <last
> <next
> <ok
> <second
> <welcome
> <yes
> =able
> =about
> =above
> =according
> =accordingly
> =across
> =actually
> =after
> =afterwards
> =again
> =against
> =all
> =allow
> =allowing
> =allows
> =almost
> =alone
> =along
> =already
> =also
> =although
> =always
> =am
> =among
> =amongst
> =another
> =any
> =anybody
> =anyhow
> =anyone
> =anything
> =anyway
> =anyways
> =anywhere
> =apart
> =appear
> =approach
> =appropriate
> =area
> =areas
> =around
> =as
> =aside
> =ask
> =asking
> =aspects
> =associated
> =at
> =available
> =away
> =awfully
> =backed
> =backing
> =backs
> =based
> =became
> =because
> =become
> =becomes
> =becoming
> =before
> =beforehand
> =began
> =behind
> =being
> =beings
> =believe
> =below
> =beside
> =besides
> =best
> =better
> =between
> =beyond
> =big
> =both
> =brief
> =briefly
> =called
> =came
> =case
> =cases
> =cause
> =causes
> =certain
> =certainly
> =changes
> =clear
> =clearly
> =co
> =com
> =come
> =comes
> =concerned
> =concerning
> =consequently
> =consider
> =considered
> =considering
> =contain
> =containing
> =contains
> =corresponding
> =course
> =current
> =currently
> =definitely
> =degree
> =degrees
> =described
> =describes
> =despite
> =did
> =differ
> =different
> =differently
> =difficult
> =directly
> =discussed
> =discusses
> =doing
> =done
> =down
> =downed
> =downing
> =downs
> =downwards
> =during
> =each
> =early
> =easily
> =easy
> =edu
> =effects
> =eg
> =either
> =else
> =elsewhere
> =end
> =ended
> =ending
> =ends
> =enough
> =entirely
> =especially
> =etc
> =even
> =evenly
> =ever
> =everybody
> =everyone
> =everything
> =everywhere
> =ex
> =exactly
> =examined
> =examines
> =example
> =excellent
> =face
> =faces
> =fact
> =facts
> =far
> =felt
> =few
> =find
> =finds
> =followed
> =following
> =follows
> =former
> =formerly
> =forth
> =full
> =fully
> =further
> =furthered
> =furthering
> =furthermore
> =furthers
> =gave
> =general
> =generally
> =get
> =gets
> =getting
> =give
> =given
> =gives
> =go
> =goes
> =going
> =gone
> =good
> =goods
> =got
> =gotten
> =great
> =greater
> =greatest
> =group
> =grouping
> =groups
> =happens
> =hardly
> =having
> =hence
> =her
> =hereafter
> =hi
> =high
> =highest
> =him
> =his
> =hopefully
> =how
> =however
> =ignored
> =immediate
> =importance
> =important
> =in
> =inasmuch
> =inc
> =include
> =included
> =includes
> =including
> =increased
> =increasing
> =indeed
> =indicate
> =indicated
> =indicates
> =inner
> =insofar
> =instead
> =interest
> =interested
> =interesting
> =interests
> =into
> =inward
> =itself
> =just
> =keep
> =keeps
> =kept
> =kind
> =kinds
> =knew
> =know
> =known
> =knows
> =lack
> =large
> =largely
> =lately
> =later
> =latest
> =latterly
> =less
> =lest
> =let
> =let's
> =like
> =liked
> =likely
> =little
> =long
> =longer
> =longest
> =look
> =looking
> =looks
> =low
> =ltd
> =made
> =mainly
> =maintains
> =major
> =make
> =makes
> =making
> =man
> =many
> =maybe
> =me
> =mean
> =means
> =member
> =members
> =men
> =mm
> =mr
> =mrs
> =ms
> =my
> =myself
> =name
> =near
> =nearly
> =necessary
> =need
> =needed
> =needing
> =needs
> =nevertheless
> =new
> =newer
> =newest
> =nobody
> =non
> =none
> =noone
> =normally
> =not
> =nothing
> =novel
> =now
> =nowhere
> =number
> =numbers
> =oh
> =okay
> =old
> =older
> =oldest
> =ones
> =onto
> =open
> =opened
> =opening
> =opens
> =order
> =ordered
> =ordering
> =orders
> =org
> =otherwise
> =ought
> =our
> =ourselves
> =out
> =outlined
> =outlines
> =outside
> =over
> =overall
> =own
> =part
> =parted
> =particular
> =particularly
> =parting
> =parts
> =per
> =place
> =placed
> =places
> =please
> =plus
> =point
> =pointed
> =pointing
> =points
> =possible
> =present
> =presented
> =presenting
> =presents
> =presumably
> =probably
> =problem
> =problems
> =provide
> =provided
> =provides
> =providing
> =put
> =puts
> =quickly
> =quite
> =qv
> =really
> =reasonably
> =regarding
> =regardless
> =regards
> =relatively
> =require
> =required
> =requires
> =respectively
> =result
> =results
> =right
> =room
> =rooms
> =s
> =said
> =saw
> =say
> =saying
> =says
> =see
> =seeing
> =seen
> =sees
> =self
> =selves
> =sensible
> =sent
> =serious
> =seriously
> =set
> =shall
> =show
> =showed
> =showing
> =shows
> =side
> =sides
> =similar
> =simply
> =simulation
> =small
> =smaller
> =smallest
> =soon
> =sorry
> =specified
> =specify
> =specifying
> =state
> =states
> =sub
> =suggest
> =suggested
> =suggestions
> =suggests
> =sure
> =take
> =taken
> =tell
> =tends
> =thank
> =thanks
> =theirs
> =thence
> =thereupon
> =thing
> =things
> =think
> =thinks
> =thorough
> =thoroughly
> =thought
> =thoughts
> =took
> =toward
> =towards
> =tried
> =tries
> =truly
> =try
> =trying
> =turn
> =turned
> =turning
> =turns
> =type
> =types
> =un
> =under
> =undertaken
> =unfortunately
> =unless
> =unlikely
> =until
> =unto
> =up
> =upon
> =us
> =use
> =used
> =useful
> =uses
> =using
> =value
> =various
> =via
> =viz
> =vs
> =w
> =want
> =wanted
> =wanting
> =wants
> =way
> =ways
> =well
> =wells
> =went
> =whatever
> =when
> =whence
> =whenever
> =where
> =whereafter
> =whereas
> =whereby
> =wherein
> =whereupon
> =wherever
> =whither
> =whoever
> =whole
> =whom
> =whose
> =why
> =wide
> =willing
> =wish
> =with
> =within
> =without
> =wonder
> =work
> =worked
> =working
> =works
> =would
> =year
> =years
> =yet
> =you
> =young
> =younger
> =youngest
> =your
> =yourself
> =yourselves
> >a
> >an
> >and
> >by
> >de
> >do
> >does
> >en
> >for
> >from
> >la
> >of
> >the
> >to
> >und
> @ain't
> @aint
> @al
> @are
> @aren't
> @be
> @been
> @but
> @can
> @can't
> @cannot
> @cant
> @could
> @couldn't
> @couldnt
> @didn't
> @didnt
> @doesn't
> @doesnt
> @don
> @don't
> @dont
> @eight
> @eighteen
> @eighth
> @eighty
> @eleven
> @et
> @every
> @except
> @fifteen
> @fifth
> @fifty
> @five
> @four
> @fourteen
> @fourth
> @fourty
> @had
> @hadn't
> @has
> @hasn't
> @have
> @haven't
> @he
> @he's
> @here
> @here's
> @hers
> @herself
> @hes
> @himself
> @hundred
> @i'd
> @i'll
> @i'm
> @i've
> @ie
> @if
> @ii
> @iii
> @is
> @isn't
> @isnt
> @it
> @it'd
> @it'll
> @it's
> @its
> @iv
> @ix
> @latter
> @least
> @made
> @may
> @meanwhile
> @merely
> @might
> @more
> @moreover
> @most
> @mostly
> @much
> @must
> @namely
> @nd
> @neither
> @never
> @nine
> @nineteen
> @ninety
> @ninth
> @nor
> @obviously
> @often
> @once
> @one
> @only
> @or
> @other
> @others
> @ours
> @perhaps
> @rather
> @rd
> @re
> @same
> @secondly
> @seconds
> @seem
> @seemed
> @seeming
> @seems
> @seven
> @seventeen
> @seventh
> @seventy
> @several
> @she
> @should
> @shouldn't
> @shouldnt
> @since
> @six
> @sixteen
> @sixth
> @sixty
> @so
> @some
> @somebody
> @somehow
> @someone
> @something
> @sometime
> @sometimes
> @somewhat
> @somewhere
> @st
> @still
> @such
> @th
> @than
> @that
> @that's
> @thats
> @their
> @them
> @themselves
> @then
> @there
> @there's
> @thereafter
> @thereby
> @therefore
> @theres
> @theres
> @these
> @they
> @they'd
> @they'll
> @they're
> @they've
> @third
> @thirteen
> @thirty
> @this
> @those
> @though
> @three
> @through
> @throughout
> @thru
> @thus
> @together
> @too
> @twelve
> @twenty
> @twice
> @two
> @usually
> @very
> @vi
> @vii
> @viii
> @was
> @wasn't
> @wasnt
> @we
> @we'd
> @we'll
> @we're
> @we've
> @were
> @weren't
> @werent
> @what
> @what's
> @whats
> @where's
> @whether
> @which
> @while
> @who
> @who's
> @will
> @won't
> @wont
> @wouldn't
> @wouldnt
> @www
> @xi
> @xii
> @xiii
> @xiv
> @xix
> @xv
> @xvi
> @xvii
> @xviii
> @xx
> @xxi
> @you'd
> @you'll
> @you're
> @you've
> @yours
> @zero
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________ Translate-devel mailing list Translate-devel at lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/translate-devel
-- 
Dwayne Bailey
Translate.org.za

+27-12-460-1095 (w)
+27-83-443-7114 (cell)



More information about the Localization mailing list