[Localization] Omitting format specifiers in plural form translations

Alexander Dupuy alex.dupuy at mac.com
Thu Aug 7 15:23:12 EDT 2008


In a message thread on the OLPC Localization mailing list:
> I wrote:
>   
>> Khaled - the problem with the "not all arguments converted" is due to  
>> missing %d in some of the plural forms of the Arabic translation (I  
>> don't know why this wasn't caught by the Pootle automated checks that  
>> Sayamindu runs - or maybe it would be caught by them, but you didn't run  
>> them before committing)
>>     

Khaled Hosny replied:
> I think this is valid for C at least (I used this many times with gnome),
> it may be a python specific issue.
>
>   

The "not all arguments converted" is a Python issue - as you note, 
applications written in C won't have any problems with a translation for 
a string with just one %d format specifier that omits the format specifier.

However, in a string like the following "There are %d files in the %s 
directory" having a plural form translation "There are a pair of files 
in the %s directory" is likely to cause an application written in C to 
crash.  Using the positional format "There are a pair of files in the 
%2$s directory" might work in some cases, but I would not want to depend 
on it, since the printf documentation says:

> There may be no gaps in the numbers of arguments specified using '$'; 
> for example, if arguments 1and 3 are specified, argument 2 must also 
> be specified somewhere in the format string.

I don't see any great solution for these sorts of strings; hopefully, 
they are rare, and in the few cases where they occur, the trick that you 
came up with for Python could be used, and might work on at least some 
systems.

>> note that the plural form 1 and plural form 2 do not have the %d marker.  
>> I suppose (I'm not an Arabic speaker) that these are something like "a  
>> year" and "a pair of years" - which may well be more idiomatic Arabic,  
>> but cause errors due to the lack of the %d parameter
>>     
> Putting the numbers will make a very bad translation, some thing like
> saying 2 day in English.
>   
>> The requirement for numeric substitutions in all plural forms is  
>> arguably a bug/misfeature of the GNU gettext / Python i18n/l10n system -  
>>     
>
> I think this a bug in python's gettext implementation, since this is
> allowed in C.
>   

The issue here is not with gettext itself - either in Python or in C, 
gettext does not interpret or replace the %d format specifier - in C, 
the substitution of %d is done by a call to one of the printf functions; 
in Python, the substitution is performed by the % string formatting 
operator.

An internationalized application that wants to print a localized version 
of "There are %d files in the %s directory" will do something like the 
following pseudo-code:

localized_string = ngettext(english_string, plural_number)
formatted_string = sprintf(localized_string, plural_number, directory_name)

This will be coded into the application, and without an addition to the 
gettext API that replaces these two calls with one integrated call (so 
that gettext calls printf itself to do the formatting) - and 
corresponding changes to all applications that use plural forms to use 
this new API, I'm not sure that there is anything that can be done to 
gettext to fix this.

In theory, one could change the application code to scan the returned 
localized string to see if a %d format specifier has been omitted, but 
this is impractical, and application code is unlikely to properly handle 
all of the possible formatting flag variants (%'d for thousands 
separators, %Id for alternate-form digits, combinations of the above, 
etc. etc.).

>
> I tried %.d which I supposed it would suppress printing the number, but
> it made no difference, however %.s does the trick. Now I'm wondering how
> bad is that since msgfmt -c gives "fatal errors" but python didn't
> complain so far.
>   

Python is much more flexible than C when it comes to implicit type 
conversion, so it's quite reasonable to use %.s to print a zero-width 
representation of a number.  I would suggest using %.0s to make it more 
explicit that this is what you are doing and that it is intentional.  
It's also probably not a great idea to use %.0s for localizing C 
applications (although it works on my Fedora 7 system) since some 
implementations of printf may cause an application crash when formatting 
a numeric value as if it were a string (even if it is zero-width).

If the specific plural form is for a value of zero, %.0d is safer for C 
(and works for Python as well) since a zero-precision representation of 
0 is the empty string (for other values, as you point out, all necessary 
digits are still printed).  This doesn't help in your case, where the 
plural forms where you want to omit the number are for values of 1 and 2.

Since these work-arounds (as well as the option of omitting the format 
specifier, for C) are possible, safe, and at least in some cases, 
necessary, it probably makes sense to modify the msgfmt --check-format 
behavior, as well as the Pootle checks for format string consistency 
(implemented in the translate-toolkit, which is why I am also sending 
this e-mail to the translate-devel mailing list).

The msgfmt --check-format code for C format strings actually allows for 
fewer format strings in the case of plurals, so that case is already 
handled correctly.  The Python format parser also allows for fewer 
unnamed format strings in the case of plurals, which is a mistake - as 
you found out, Python will throw an exception if there are too many 
parameters.  So 
gettext-0.17/gettext-tools/src/format-python.c:format_check() lines 
496-513 needs to be revised so that regardless of the "strict" 
(equality) parameter, (spec1->unnamed_arg_count != 
spec2->unnamed_arg_count) causes an error.

On the other hand, if "strict" (equality) is not true, a 
(spec2.unnamed[i].type == FAT_PLACEHOLDER) should not cause a mismatch.  
FAT_PLACEHOLDER would be a new type that would be assigned by the 
format_parse() function for the format string %.0s.

For the translate-toolkit checks, filters/checks.py:printf() lines 
511-526 need to be revised so that if self.hasplural and 
match2.group('fullvar')=='.0s' it will be treated as a placeholder and 
not cause a failure (this change needs to be made on both sides of the 
if match2.group('ord'): test).

These changes would allow %.0s to be used as a placeholder when omitting 
format specifiers in plural form translations for Python applications, 
without triggering undesired errors from the msgfmt and 
translate-toolkit checking.

@alex
-- 
mailto:alex.dupuy at mac.com



More information about the Localization mailing list