An Agreeable Procrastination – and the blog of Niels Kühnel

Code, the universe and everything

Archive for May 2011

The New Localization Framework in Umbraco 5

with 18 comments

This is a primer on the new localization framework in Umbraco 5.

Microsoft did a very fine job with System.Globalization when it comes to formatting numbers and dates for different locales. The localization framework in Umbraco adds to this support for grammatical differences between (spoken) languages including differences in plural forms, order of words in sentences etc.

Generally seen the framework consists of:

  • A replacement for resource strings that allows texts to be combined from a multitude of layered sources.
  • A superset of the string.Format syntax with a domain specific template language tailored for handling grammatical differences between languages.

The main objective is to separate grammatical logic from code and to maximize the length of text passages to be localized to give translators maximum context and flexibility. All this while minimizing the number of redundant texts.

Let’s look at a simple example to illustrate why you need this framework. Suppose you want to greet the user in some system with the number of new messages like “Welcome Fletcher. You have 5 new messages”. You quickly realize that this doesn’t work with only 1 message and take a simple approach with

string.Format(“Welcome {0}. You have {1} new message(s)”, name, count).

That works for English but it’s not suitable for localization because other languages may not support the “word + (plural ending)” form very well. Besides, you probably don’t want your fancy Web 2.0 site to print messages that looks like something from a DOS command prompt (y/n?).

Instead you might solve this with

string.Format(“Welcome {0}. You have {1} new {2}”, name, count, count == 1 ? “message” : “messages”)

or

string.Format(Get(“Greeting”), name, count, count == 1 ? Get(“MessageSingular”) : Get(“MessagePlural”)

(Assuming that you have a Get method to get resource strings)

That will work great for most Western languages. You may however have cut off the French because they use the singular form for zero too (They have 0 message). And it will definitely not work for Slavic languages because they have much more difficult rules for plural forms. See http://translate.sourceforge.net/wiki/l10n/pluralforms for reference (e.g. one of the Polish cases is n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20))

Not supporting a lot of exotic languages may be okay for your intended audience but the approach still has some disadvantages:

  1. You have now included language specific grammatical logic in your code. It’s very annoying to fix bugs related to this and it may be a never ending story as new languages are targeted.
  2. You have three different texts to translate for the same message and it’s not very clear in which context the atomic texts “message” and “messages” belong.
  3. You need to explain what {0}, {1} and {2} represents in the text and it becomes ghastly if you have even more parameters. That may entice you to split the text to reduce the number of parameters but then the translator will lose some of the flexibility.

In Umbraco 5 you’ll write

Localize(“Greeting”, new { Username = name, MessageCount = count }).

From a developer’s perspective that’s perfect because the code above very clearly expresses “I want to have the greeting text here, and I’ll pass these named parameters that it need for it”. Now, from a translator’s perspective this is also great because the framework’s entire pattern syntax is available for creating a translation without any compromises.

The English version for this would be

Welcome {Username}. You have #MessageCount{1: 1 new message | {#} new messages}

The first thing you’ll notice is that the named parameters are used instead of numbers. In code anonymous types are used to specify the parameters and in the text it’s clear what the parameters represent. In this example it’s not needed but you can use the normal format specifiers after the name so {MessageCount:N2} would become “5.00”.

The second thing is the “switch” construct that allows you to use different texts for different counts. It has the syntax

#[ParameterName] { [Condition 1] : Text 1 | [Condition 2] : Text 2| Text in other cases}.

Within the switch body the special parameter {#} means the value of the parameter being switched on.

This should open some opportunities as you, without changing the code in your application, could make more interesting texts by changing it to

You have #MessageCount{0: no new messages | 1: one new message | < 10: {#} new messages | a lot of messages!}

There’s a pretty extensive syntax for the switch conditions and all the plural rules from plural form reference link above are supported.

Now, you may argue that translators don’t like writing curly braces but the syntax is just what the framework expects. Feel free to make your own easier-to-understand intermediate format for the translator or even create a graphical editor. The point is that grammatical logic is effectively removed from your application’s code.

(By the way, the framework is not locked to this syntax. It’s just the default parser. The framework works on ASTs and you can create your own grammar and parse that into these ASTs instead.)

Now let’s add one final thing to the example. Say we want some HTML in the text as we want it to be

“Welcome <span class=’user-name’>Fletcher</span>. You have <span>5 new messages</span>”

With the string.Format approach you may either have to chop the text into tiny pieces or accept that the translations include markup. Neither is desirable. The former approach creates an immense number of texts with very unclear purposes and the latter makes you tired if you decide to change the markup after the software has been translated to 20 languages.

With the Umbraco localization framework you can write

Welcome <NameFormat: {Username}>. You have <MessageFormat: #MessageCount{…} >

And then dynamically specify the markup with

Localize(“Greeting”, new { Username = name, MessageCount = count, NameFormat = “<span class=’user-name’>{#}</span>”, MessageFormat=”<span>{#}</span>” }).

This gives at least the advantages that 1) you don’t have to split the text so the translator still has full context and flexibility and 2) the translated text is not tied to HTML and you could, in principle, use it on other devices because it just contains “format markers”.

With the default settings the parameter values are HTML encoded but the format you specify is not, so little Bobby <XSS would be greeted with “Welcome <span class=’user-name’>Bobby &lt;XSS</span>”

In a later blog posts I’ll dive deeper into the syntax and its features that include reusable templates, switching on timespans, roman numbers and much more (you can see an up to date’ish specification of the grammar here), but my next post will be about

Text sources

My next blog post will be about how text sources are structured, the default XML format and how new sources can be implemented and embedded in assembly manifests. One of the main benefits over ordinary resources strings is that even if texts are embedded in assemblies, texts for other languages can be added from XML files, databases etc. These other sources can also replace/correct texts in existing languages. You’ll also see how texts can be arranged in namespaces to avoid clashes and how properties of MVC view models are automatically mapped to text keys without the use of attributes.

Even if you don’t expect your application to be translated to other languages you can still benefit from the framework as it greatly helps you maintain your texts without hacking your code.

Rembember: “Language is vivid. Don’t let computer languages keep it down!”

Written by niels.kuhnel

May 12, 2011 at 3:04 am

Posted in Uncategorized