[Localization] Arabic Projects

Nicholas Bodley nbodley at speakeasy.net
Wed Jul 30 10:44:07 EDT 2008


On Wed Jul 30  7:17 , "Walter Bender"  sent:

>Welcome to the project.
>
>Localization is more that string translation. In particular, there is
>work to be done to improve the RTL rendering in Sugar. Do you know of
>any Python programmers who could help with this?
>
>regards.
>
>-walter

I'm really uncertain whether Sugar developers are aware of the complexity of
rendering Arabic text acceptably. I'd love to be able to say that rendering (that
is, displaying and printing) Arabic is easy; however, it is anything but easy.
Surely, our native Arabic speakers are completely aware of this, but it might
help if I explain something about what is involved, for the sake of those who
don't yet know.

First, a bit of personal background: I'm very interested in writing systems, but
am strictly a dilettante/amateur in the field. Please do correct anything I'll
say that is wrong!

Our Latin (or latin, or Roman/roman)* alphabet is quite straightforward to write
and typeset; it's simply a matter of placing letters in LtoR order on the writing
line. This holds true of several other alphabets, such as Greek and Cyrillic, but
those of India and Southeast Asia are not so simple. *Homework I've neglected to
do! :)

Hebrew is another RtoL writing system, but it's extremely simple when compared to
Arabic.

Keeping in mind that probably almost all writing systems have a "typeset" variety
as well as a cursive, flowing, handwritten variety, Arabic is rare in that when
properly written, its nature, even when "typeset" (and when rendered by computer)
is essentially cursive. Although trademarks and product labels can seem to be
"typeset", nevertheless, afaik, the only way to acceptably render Arabic text is
essentially according to cursive form, as if handwritten.

It is easy to position individual Arabic letters in RtoL sequence, leaving small
spaces between letters, but the result, I'm essentially sure, looks very bad; it
would definitely not be acceptable in an XO! (Apparently Arabic typewriters
created rather wretched-looking text; I'd love to know.)

Arabic letters can have as many as four different forms for each letter, although
(pretty sure!) not every letter has four different forms. More, soon, about this.

Furthermore, keeping in mind the cursive (from the Latin for "running", iirc)
character of Arabic when properly rendered, consecutive letters are often joined.

As to the four forms, one is "standalone", or isolated -- this is the form a
letter takes when it's all by itself.

The other three forms are for the beginning and for the end of a word, as well as
a third form used within a word.

The names I recall for these forms are initial, medial, final, and isolated.

Arabic text in computer form could simply specify each letter is sequence, with
no regard to which of the four forms is to be used; however, to simplify the
process of rendering somewhat, "combining forms" are offered in Unicode (or
were!) -- see Arabic Presentation Forms, A and B,
Unicode ranges U+FB50--U+FDFF (A) and U+FE70--U+FEFF. (This was Unicode 3; sorry
if I mislead.)
Apparently, these simplify the process of rendering decent Arabic, although
(fairly sure) they are not a completely-acceptable solution.

=+=

In general, the term seen for describing the composition of Arabic text for
rendering from Unicode (or other coding) is "shaping and joining". As to shaping,
I haven't learned about that; perhaps even three of the four forms sometimes need
modifying before they join acceptably. Joining, of course, refers to combining
adjacent letters as if they were handwritten, with a continuous stroke. (Indeed,
there are more details, only some of which I'm aware of ... thinking of tatweels,
for instance.) 

In recent years, MS Windows has been able to support rendering Arabic by calls to
a DLL, specifically Uniscribe.dll. Although I'm no MS fan (nor a fanatical hater,
either), I do think that Uniscribe in its various versions has been a great
benefit to computer use of other than simple alphabetic LtoR writing systems.

I'm woefully ignorant of how Linux renders Arabic, but to my novice's eyes, it
seems to render acceptably. 

Again, I'd rather not say so, but if I understand that Sugar is (now, or to be)
cross-platform, it would be a lot simpler (unfortunately) to let the OS take care
of "shaping and joining".

Perhaps the Opera and/or Firefox browser developers could advise Sugar on the
best way to handle RtoL writing systems. Arabic and Hebrew are the principal
ones, but Urdu (which uses Arabic script) also needs to be included. Far better
to have all the "software interfaces" in place for all important writing systems!

The "alphabets" of India and Southeast Asia are a different matter; afaik, most
or all are LtoR, but the letters and other graphic elements are often not simply
placed one after the other on the writing line. There are important symbols that
are not letters, and the physical placement of letters is not necessarily simple
nor even in sequence. A good number of writing systems are not technically
alphabets; they are technically abugidas. 

Abugidas consist largely of syllables, consonants followed by one particular
vowel, usually [a], afaik. For instance, Gujarati (from India) has KA, KHA, GA,
GHA, NGA, CA, CHA, JA, JHA, [...] SA, HA. [A] is usually called the "inherent
vowel". Of course, each of these has its own corresponding "letter". For other
sounds, added symbols indicate a different vowel. This scheme seems to be used by
most (possibly all?) writing systems in India, other than those that use the
Arabic script (if any!). (I'm raising my political "rabbit ears", hoping not to
offend.)

There are several styles of Arabic, although probably their distinctions are not
of primary concern to Sugar and the XO; calligraphy, though, is another matter.
(Some years ago, I had a look at Web sites devoted to Arabic calligraphy, and was
astonished to see the great variety, the amount, and often the great beauty of it
all. It was very inspiring. Recommended.)

However, Urdu might need to be taken into account; I just don't know enough. Urdu
is such an ornate and elaborate form of Arabic script (Nastaliq?) that until a
few years ago (as far as I know!), Urdu newspapers were prepared by calligraphers
who wrote the text onto [large cards, perhaps] that were then photographed to
make printing plates. I'm not at all sure of my info., but apparently it is now
possible to typeset acceptable Urdu by computer.

It would seem reasonable to expect an Urdu XO to offer a simpler, but
linguistically correct form of Urdu script. I must respectfully bow out of the
room and listen, though, because I just don't know.

Referring to computer typesetting of Arabic, Scientific American magazine
published an excellent article (roughly 1992?) about computer typesetting of
Arabic. Proper typesetting of Arabic cannot be done mechanically; it requires a
computer.

Even the numerals are of modest concern; there are two different sets of symbols
in the Arabic-alphabetic world. Unicode 3.0 refers to them as Arabic-Indic and
Eastern Arabic-Indic. In particular, it seems that the symbols for 4 and 6 differ
considerably. See the Unicode ranges U+0660..U+0669 for Arabic-Indic, and
U+06F0..U+06F9 for Eastern Arabic-Indic.

As well, the writing systems of India and SE Asia have their own forms of numerals.

One other matter is mixed-directional text, that is, text that includes some LtoR
as well as some RtoL content. Unicode.org took care of this several years ago
when it defined the Bidirectional Algorithm; that's been worked out. 

There has been some i18n effort in Linux, as probably most l10n subscribers know;
at least, one can Google on [Li18nux](not L18nux, btw).

As I see it, fortunately, any OS (or family of them) that's "serious" has pretty
well settled down so it can properly support rendering Arabic; my semi-informed
guess is that Sugar for different OSes will need some individual code for each OS
to support the available rendering libraries.

Respectfully,

Nicholas Bodley
Waltham, Mass.
Sent from Speakeasy.net web mail
(I apologize for being utterly, totally inactive for the past month!)
Still in the early stages of clearing the e-mail backlog -- >3,000 messages
before de-spamming, and Speakeasy Web mail is quite disappointing.

>Localization mailing list
>Localization at lists.laptop.org
>http://lists.laptop.org/listinfo/localization



More information about the Localization mailing list