[Localization] Arabic Projects

Sat Aug 2 20:05:20 EDT 2008

On Wed, Jul 30, 2008 at 7:44 AM, Nicholas Bodley <nbodley at speakeasy.net> wrote:
>
> On Wed Jul 30  7:17 , "Walter Bender"  sent:
>
>>Welcome to the project.
>>
>>Localization is more that string translation. In particular, there is
>>work to be done to improve the RTL rendering in Sugar. Do you know of
>>any Python programmers who could help with this?

With skills in using Pango from Python. We need that documented somewhere.

Also, Pango has been integrated into Squeak, but nobody in that
community has worked on complex scripts AFAIK, so we need more people
with Smalltalk and Pango skills.

>>regards.
>>
>>-walter
>
> I'm really uncertain whether Sugar developers are aware of the complexity of
> rendering Arabic text acceptably. I'd love to be able to say that rendering (that
> is, displaying and printing) Arabic is easy; however, it is anything but easy.
> Surely, our native Arabic speakers are completely aware of this, but it might
> help if I explain something about what is involved, for the sake of those who
> don't yet know.
>
> First, a bit of personal background: I'm very interested in writing systems, but
> am strictly a dilettante/amateur in the field. Please do correct anything I'll
> say that is wrong!

I have written about Unicode, including RTL and Bidi, professionally,
but have not done development in this area.

> Our Latin (or latin, or Roman/roman)* alphabet is quite straightforward to write
> and typeset; it's simply a matter of placing letters

numbers, and punctuation

>  in LtoR order on the writing
> line.

and getting accents placed correctly on letters.

> This holds true of several other alphabets, such as Greek and Cyrillic, but
> those of India and Southeast Asia are not so simple. *Homework I've neglected to
> do! :)

The rudiments of rendering for Asian scripts are described in the
Unicode Standard, available in PDF on the unicode.org site.

> Hebrew is another RtoL writing system, but it's extremely simple when compared to
> Arabic.

First important point: Hebrew and Arabic are Bidi scripts, not just
RTL. Numbers in particular are written LTR within RTL text, and there
are other exceptions, which differ among languages. The essentials are
described in Unicode Standard Annex #9, Unicode Bidirectional
Algorithm. http://unicode.org/reports/tr9/. Some of the
language-specific details are handled in the Unicode Common Locale
Data Repository, http://unicode.org/cldr/.

> Keeping in mind that probably almost all writing systems have a "typeset" variety
> as well as a cursive, flowing, handwritten variety, Arabic is rare in that when
> properly written, its nature, even when "typeset" (and when rendered by computer)
> is essentially cursive. Although trademarks and product labels can seem to be
> "typeset", nevertheless, afaik, the only way to acceptably render Arabic text is
> essentially according to cursive form, as if handwritten.

The rudiments are available in
http://www.unicode.org/versions/Unicode5.0.0/ch08.pdf
Unicode Standard 5.0 8 Middle Eastern Scripts

> It is easy to position individual Arabic letters in RtoL sequence, leaving small
> spaces between letters, but the result, I'm essentially sure, looks very bad;

Absolutely.

> it
> would definitely not be acceptable in an XO! (Apparently Arabic typewriters
> created rather wretched-looking text; I'd love to know.)

Here is a random example, giving the same sequence of letters
separately and then joined.

ش س ي ب ل ا ت ن م ك
شسيبلاتنمك

> Arabic letters can have as many as four different forms for each letter, although
> (pretty sure!) not every letter has four different forms. More, soon, about this.

Correct.

> Furthermore, keeping in mind the cursive (from the Latin for "running", iirc)
> character of Arabic when properly rendered, consecutive letters are often joined.
>
> As to the four forms, one is "standalone", or isolated -- this is the form a
> letter takes when it's all by itself.

For example:

ي

> The other three forms are for the beginning and for the end of a word, as well as
> a third form used within a word.

The same letter as above, in all three combining forms.

ييي

> The names I recall for these forms are initial, medial, final, and isolated.
>
> Arabic text in computer form could simply specify each letter is sequence, with
> no regard to which of the four forms is to be used; however, to simplify the
> process of rendering somewhat, "combining forms" are offered in Unicode (or
> were!) -- see Arabic Presentation Forms, A and B,
> Unicode ranges U+FB50--U+FDFF (A) and U+FE70--U+FEFF. (This was Unicode 3; sorry
> if I mislead.)

The combining forms are deprecated. They greatly complicate sorting
and searching.

> Apparently, these simplify the process of rendering decent Arabic, although
> (fairly sure) they are not a completely-acceptable solution.

This turns out to be obsolete information. Before Unicode, fonts (as
previously mechanical typewriters) did not support the necessary
rendering functions to handle contextual forms. Now they do.

> =+=
>
> In general, the term seen for describing the composition of Arabic text for
> rendering from Unicode (or other coding) is "shaping and joining". As to shaping,
> I haven't learned about that; perhaps even three of the four forms sometimes need
> modifying before they join acceptably.

Shaping includes picking the correct contextual form. Joining can
include picking further glyph variations to join specific letter
pairs, and placing letters so that their joining points connect. In
many forms of Arabic, letters join on a slope, not on a flat baseline
as in the example I gave above.

> Joining, of course, refers to combining
> adjacent letters as if they were handwritten, with a continuous stroke. (Indeed,
> there are more details, only some of which I'm aware of ... thinking of tatweels,
> for instance.)
>
> In recent years, MS Windows has been able to support rendering Arabic by calls to
> a DLL, specifically Uniscribe.dll. Although I'm no MS fan (nor a fanatical hater,
> either), I do think that Uniscribe in its various versions has been a great
> benefit to computer use of other than simple alphabetic LtoR writing systems.
>
> I'm woefully ignorant of how Linux renders Arabic, but to my novice's eyes, it
> seems to render acceptably.

Pango is the usual rendering engine for Linux. There are others, SIL
Graphite and Trolltech Scribe.

> Again, I'd rather not say so, but if I understand that Sugar is (now, or to be)
> cross-platform, it would be a lot simpler (unfortunately) to let the OS take care
> of "shaping and joining".

Yes.

> Perhaps the Opera and/or Firefox browser developers could advise Sugar on the
> best way to handle RtoL writing systems. Arabic and Hebrew are the principal
> ones, but Urdu (which uses Arabic script) also needs to be included. Far better
> to have all the "software interfaces" in place for all important writing systems!

Also Dari, Pashto, Sindhi, Farsi, Azeri, Kurdish, and historical forms
of Turkish, Hausa, Swahili, Indonesian, Uzbek, Tajik, and others.

> The "alphabets" of India and Southeast Asia are a different matter; afaik, most
> or all are LtoR, but the letters and other graphic elements are often not simply
> placed one after the other on the writing line. There are important symbols that
> are not letters, and the physical placement of letters is not necessarily simple
> nor even in sequence. A good number of writing systems are not technically
> alphabets; they are technically abugidas.
>
> Abugidas consist largely of syllables, consonants followed by one particular
> vowel, usually [a], afaik. For instance, Gujarati (from India) has KA, KHA, GA,
> GHA, NGA, CA, CHA, JA, JHA, [...] SA, HA. [A] is usually called the "inherent
> vowel". Of course, each of these has its own corresponding "letter". For other
> sounds, added symbols indicate a different vowel. This scheme seems to be used by
> most (possibly all?) writing systems in India, other than those that use the
> Arabic script (if any!). (I'm raising my political "rabbit ears", hoping not to
> offend.)
>
> There are several styles of Arabic, although probably their distinctions are not
> of primary concern to Sugar and the XO;

The larger distinctions, say between Kufic and Diwani, are important.
Qur'anic Arabic is important.

> calligraphy, though, is another matter.
> (Some years ago, I had a look at Web sites devoted to Arabic calligraphy, and was
> astonished to see the great variety, the amount, and often the great beauty of it
> all. It was very inspiring. Recommended.)
>
> However, Urdu might need to be taken into account; I just don't know enough. Urdu
> is such an ornate and elaborate form of Arabic script (Nastaliq?) that until a
> few years ago (as far as I know!), Urdu newspapers were prepared by calligraphers
> who wrote the text onto [large cards, perhaps] that were then photographed to
> make printing plates. I'm not at all sure of my info., but apparently it is now
> possible to typeset acceptable Urdu by computer.

Yes.

> It would seem reasonable to expect an Urdu XO to offer a simpler, but
> linguistically correct form of Urdu script. I must respectfully bow out of the
> room and listen, though, because I just don't know.
>
> Referring to computer typesetting of Arabic, Scientific American magazine
> published an excellent article (roughly 1992?) about computer typesetting of
> Arabic. Proper typesetting of Arabic cannot be done mechanically; it requires a
> computer.

As I recall, the first Arabic Linotype was installed in Egypt in 1912.
There is also a long history of Arabic typewriters and character-cell
terminals.

> Even the numerals are of modest concern; there are two different sets of symbols
> in the Arabic-alphabetic world. Unicode 3.0 refers to them as Arabic-Indic and
> Eastern Arabic-Indic. In particular, it seems that the symbols for 4 and 6 differ
> considerably. See the Unicode ranges U+0660..U+0669 for Arabic-Indic, and
> U+06F0..U+06F9 for Eastern Arabic-Indic.
>
> As well, the writing systems of India and SE Asia have their own forms of numerals.
>
> One other matter is mixed-directional text, that is, text that includes some LtoR
> as well as some RtoL content. Unicode.org took care of this several years ago
> when it defined the Bidirectional Algorithm; that's been worked out.

The Bidi Algorithm is fine for laying out paragraphs, but nowhere near
enough for Web pages and application menus.

> There has been some i18n effort in Linux, as probably most l10n subscribers know;
> at least, one can Google on [Li18nux](not L18nux, btw).

Lots. There are numerous mailing lists for localization and Unicode
issues in different software: emacs, Yudit, Gnome, Perl, and so on.

> As I see it, fortunately, any OS (or family of them) that's "serious" has pretty
> well settled down so it can properly support rendering Arabic; my semi-informed
> guess is that Sugar for different OSes will need some individual code for each OS
> to support the available rendering libraries.

Yes, although we could write an interface that would be portable
between Pango, UniScribe, and Apple ATSUI, at least, if the other
problems of porting Sugar to Mac and Windows can be solved.

> Respectfully,
>
> Nicholas Bodley
> Waltham, Mass.
> Sent from Speakeasy.net web mail
> (I apologize for being utterly, totally inactive for the past month!)
> Still in the early stages of clearing the e-mail backlog -- >3,000 messages
> before de-spamming, and Speakeasy Web mail is quite disappointing.
>
>>Localization mailing list
>>Localization at lists.laptop.org
>>http://lists.laptop.org/listinfo/localization
>
> _______________________________________________
> Localization mailing list
> Localization at lists.laptop.org
> http://lists.laptop.org/listinfo/localization
>

-- 
Silent Thunder [默雷/शब्दगर्ज] is my name,
And Children are my nation.
The whole world is my dwelling place,
And Truth my destination.