Internationalization Getting Started

Bharadwaj Sridharan
13y
55.6k
0
0

Article

Preface

Internationalization is process of developing applications which can be made available in multiple locations supporting different languages and cultures.

This paper explores into the designing, developing and testing of world ready applications providing an easy guide for developers to get started.

How this paper is structured

After a brief introduction to the history and different forms of Unicode, the paper describes the mechanisms of input and display of multilingual content from different sources. Subsequently, best practices for user interface design and writing culture aware code are explained. The paper ends with a note on different aspects of testing internationalized software.

Introduction

Internationalization is process of developing software which works uniformly across multiple regions and cultures. There are two aspects to International software - World Readiness (Globalization) and Localization. World Readiness refers to the process of designing and coding the product such that it can be easily localized for different regions. Localization involves translating and customizing the products for different regions. A basic knowledge of character sets and encodings is essential for understanding the concepts of Internationalization; hence this paper starts with an introduction to the evolution of character sets and Unicode.

Text, as we know, is a collection of characters. A character could be a letter, symbol, digit, etc. Each character can be represented using a number(s). The assignment of a number to a character can be complicated. For example, a character 'a' can be represented using a single number. But a character with combination of 'a' and two dots above it. We can either choose to assign a single number to this combination or represent it as two separate numbers, one for character 'a' and another for the two dots. A collection of characters and numbers assigned to them is called a character set. To define a character set, you first decide how many characters are required, set an upper limit for the numbers used for assignment and then assign numbers to each character. The upper limit defines the number of bytes in memory required to store each character in the character set. That leads us to Encoding. Encoding is the way a character is represented in a byte stream. The same character value can be encoded in multiple ways.

There are multiple character sets in use today. The earlier character sets, which are still in use, like ASCII and EBCDIC have single byte encodings. That is, a single byte is used to represent a character. Double-byte character sets (DBCS) were developed to provide enough space for the thousands of ideographic characters in East Asian writing systems. Here, the encoding is still byte-based, but two bytes together represent a single character.

Unicode
Unicode was a brave effort to create a single character set that included every reasonable writing system. Until now, we've assumed that a letter maps to some bits which you can store on disk or in memory. In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is dependent on the encoding used.

Every letter in every alphabet is assigned a code point by the Unicode consortium which is written like this: U+0645. The U+ means "Unicode" and the numbers are hexadecimal. U+FEC9 is the Arabic letter A. The English letter A would be U+0041.

Assume we have a string:
Hello

This, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F

Answering the question of how these letters are stored in memory leads us to the various encoding mechanisms as specified by the Unicode consortium. UTF - 8 and UTF - 16 are the commonly used formats.

UTF - 8
UTF - 8 is an encoding mechanism wherein code points from 0 - 127 are stored in a single byte and code points above 127 are stored as 2, 3 or up to 6 bytes. The advantage of UTF - 8 encoding is that for code points below 128, it is compatible with ASCII. Plain English characters would be stored in identical fashion in both ASCII and UTF - 8 encoding. Specifically, Hello, which was stored as {U+0048, U+0065, U+006C, U+006C, U+006F} in Unicode (UTF-16) would now be stored as {48, 65, 6C, 6C, 6F} which is a same way it would be stored under ASCII.

UTF - 16
UTF -16 is a dual byte encoding mechanism for Unicode. Each code point is represented as sequence of two bytes. If the Big-Endian convention is followed, the most significant byte comes first; else in Little-Endian convention the least significant byte is represented first. UTF - 16 is also known as UCS - 2.

Choosing the right encoding
With various encoding options available for Unicode text, it is important to pick the right encoding that suits our requirements. The following are the recommended practices for usage of Unicode content

Choose UTF-16 as the fundamental representation of text in your application. MS.NET framework internally stores all strings in UTF-16 format.
Choose UTF-8 for application interoperability. For example, for content sent to be displayed in browsers that do not support Unicode or over networks and servers that do not support Unicode.
Avoid processing character data byte by byte
UTF-8 is not recommended for compression of text. It actually expands the size of the data for most languages.

Input

One of the challenges which English language programmers have, when learning to develop international software, is the simulation of different languages using their existing keyboard. Microsoft operating system provides Input method editors (IMEs) for enabling multilingual input. Input method editors are software that allows input in multiple languages using a standard 101-key keyboard. An IME consists of both an engine that converts keystrokes into phonetic and ideographic characters and a dictionary of commonly used ideographic words. As the user enters keystrokes, the IME engine attempts to identify which character or characters the keystrokes should be converted into.

As an example, if you need to input in Hindi (Indic), you would take the following steps

Install an IME which supports Hindi language.
From the languages tab in the Regional and Language Options property sheet, add Hindi as the new input language. While selecting Hindi, select the newly installed IME for Hindi.
Open notepad, switch to Hindi language from the language bar and input characters in Hindi. Please remember to select a font which supports Unicode content (like Arial Unicode MS) and save the file in either UTF-8 or Unicode (UTF-16) formats.

Please note that the default input language is assigned on the per thread basis when the thread is created. Similarly, switching to different input language is done on a per-thread basis. Thus you can have two different applications each having a different input language.

In the web scenario, Internet explorer handles the rendering of multiple input languages as long as Unicode encoding (either UTF-8 or UTF-16) is used for the HTML content.

IMEs also provide support for new input services such as voice recognition engines and hand-writing recognition engines.

Rendering

An Internationalized product should properly display all supported scripts in accordance with the linguistic characteristics associated with them. The characteristics include bi-directionality, character reordering, contextual shaping, combining characters and special rules in terms of word breaking, line breaking and text justification.

For achieving the above characteristics, the first and foremost requirement is that of a font which can support multiple scripts.

Fonts are the final form in which multilingual content is displayed to the user. When editing a multilingual document, the user should not be expected to select a different font for each script because the user might not know the suitable font to select.

OpenType fonts, developed jointly by Microsoft and Adobe allow rich mapping between characters and glyphs, thus enabling support for ligatures, positional forms, alternates and other substitutions. Core OpenType fonts such as Tahoma and Arial contain glyphs for Western and Central European, Hebrew, Arabic, Greek, Turkish, Baltic, Cyrillic, and Vietnamese scripts. And although OpenType fonts don't contain East Asian Scripts (performance impact due to huge character set), they link to fonts that do.

User Interface

The user interface, on which the fonts get rendered, should support the different fonts and character widths as well as the varying spacing requirements introduced by translation. For example, it is a well known fact that German text occupies more space to convey the same information when compared to English. Thus the user interface should be adaptable to accommodate user messages in both languages.

Following are the recommended guidelines for user interface design

Labels beside textboxes should appear on top of the textboxes in order to handle text with larger width. If it is still required to have the label beside a textbox, we should leave enough room for text to grow.
Dialog boxes may expand while localizing. It is recommended to leave additional space between the end of message and the edge of dialog box. It is recommended to provide 30 % space for further expansion.
Recommended not to have text in images and icons. Having text in images will make it difficult to localize and require the efforts of a user interface designer.
Buttons and labels should have a larger width to accommodate for longer text in non English languages.
Images embedded within resource files of forms should be moved out as separate image files
Mirroring: Mirroring the process of localizing the user interface to handle right to left (RTL) languages like Arabic, Hebrew, Farsi, etc. To give a perfect RTL look and feel to an application's UI, both the text and the UI elements need to be laid out from right to left once they are translated into RTL languages. A complete discussion on Mirroring is not within the scope of this paper.
Graphical images: Graphics can be difficult and expensive to translate. Hence it is preferable to have graphics which are universally acceptable.
Following are some of the key user interface elements that need to be localized
1. Menus
2. Messages
3. Dialog boxes
4. Images
5. Toolbars
6. Status bar.

Storage

Information retrieval and storage in multilingual applications warrants careful attention. The key sources of storage include file and database.

File
While reading from a file which contains Unicode characters, it is important to specify the encoding. If encoding is not provided, a default encoding may be used (UTF-8 in case of .NET framework) and this could result in misinterpretation of data. For example, if a text file is stored in UTF-16 format and we try to read the file without specifying the encoding, .NET will interpret the UTF-16 characters as UTF-8 resulting in unintelligible input. Please note that ASCII and UTF-8 are interoperable if the file contains only English characters having code point less than 128.

Database
To support Unicode characters in databases, you need to use the special data types defined for this purpose. For example, SQL Server defines nchar, nvarchar, ntext data types to allow you to store Unicode text. (The n prefix for these data types comes from the SQL-92 standard for National (Unicode) data types). Use of nchar, nvarchar, and ntext in SQL Server 2000 is the same as char, varchar, and text, respectively, except that:

Unicode supports a wider range of characters.
More space is needed to store Unicode characters.
The maximum size of nchar and nvarchar columns is 4,000 characters, not 8,000 characters like char and varchar.
Unicode constants are specified with a leading N: N'A Unicode string'. The following example shows how to insert Unicode values (Russian) into an nvarchar column.

Once you have defined a Unicode compatible column in database, inserting and retrieving multilingual data is same as working with regular (ASCII) data. While migrating databases, care should be taken while mapping source and target data columns. If the data type of source is nchar, and the data type of target is char, on migration, data loss can happen.

Globalization: Writing culture-aware code

In a single code base approach, the code has to be constructed to handle multiple cultures. Representation and interpretation of formatted data such as date, time and currency values vary from one culture to another. The simplest example is that of date. In United States the default standard is mm/dd/yyyy. But in countries like India and United Kingdom, the standard date format in use is dd/mm/yyyy.

Following general guidelines are useful in writing globalization ready code

Currency formatting
Culture aware currency formatting has to take into account the following elements
- Currency symbol: The symbol can be placed either before the currency value or after the value. Moreover, it can be a predefined symbol like '$' or a series of letters like 'Rs'.
- Negative-amount display: The placement of negative symbol can be before the currency symbol or before the digits but after the symbol or after the digits.
  
  int digits = 100;
  System.Threading.Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");
  string formattedCurrency = digits.ToString("C");
  Console.WriteLine(formattedCurrency);
  
  The above example initializes the thread's culture to US region and English language. Hence, when formatting a number in currency format, the output will be '$100.00'. If the culture is set to "en-GB" instead of "en-US", the output would be '£100.00'.
String operations
Culture aware String manipulations have to take into account the following operations
- String casing
  In English, the first character in the names of days of week is capitalized. But in Russian, the days are week start with a small case. In fact, capitalizing the word for 'Wednesday' changes the meaning to 'environment'. Moreover, some of the languages like Tamil and Hindi (Indic) do not have casing.
- String comparison and sorting
  The alphabetical order can vary among languages. Thus the sequencing in dictionaries and phone books varies accordingly. In Swedish, for example, some vowels with an accent sort after "Z", whereas in other European countries the same accented vowel comes right after the non-diacritic vowel. Languages outside the Latin scripts like some Asian languages have several sort orders depending on phonetics, radical order, etc. While performing comparison, it is important to know when to perform a locale aware comparison. For example, while comparing a value retrieved from registry against another value, we need to perform a culture independent comparison. The same would apply for application specific settings that are usually stored in configuration files. .NET framework provides an Invariant Culture for these types of comparisons. The invariant culture is neither associated with a language nor a culture or region.
Number formatting
Numbers are represented in different formats in different regions. While handling and formatting numbers, the following aspects should be taken care of
- The character used as a thousands separator
  In United States, this character is a comma (example 10,000,000). In Germany, it is a period (example 10.000.000.
- The character used as the decimal separator
  The India, this character is a period (.). In Germany, it is a comma (,).
  
  Consider the following example to illustrate the above two concepts. An integer value is first parsed from a numeric string containing numeric data as per US culture. Later, the same integer value is printed out as a string representation in German culture. The output is 1.000.000,00. Notice the change in decimal operator symbol and thousands separator.
  
  string str = "1,000,000.00";
  int digits = int.Parse(str,NumberStyles.Number, new CultureInfo("en-US"));
  Console.WriteLine(digits.ToString("N",new CultureInfo("de-DE")));
- The way negative numbers are displayed.
  Negative numbers can be represented with the negative sign appearing before the digits or with a trailing negative sign.
- The placement of percentage sign (%)
  Similarly, the percentage sign can be placed before (%100) or after the digits (100%).
- Digits grouping
  Digits grouping refers to the ways the in which the digits in a numbers are separated. For example, in India, the digits are separated at 2 digits except for the 3-digit grouping for denoting the hundreds.
  
  The user can define preferred number-formatting parameters by making selections from the Numbers tab of the Customize Regional Options property sheet, within the Regional And Language Options property sheet.
  
  In .NET framework, to retrieve the number format settings, the NumberFormat property of the Current thread's Current culture can be used.
  
  For example, assume that the user has selected the negative number format as (1.1). This means that the user represents negative numbers by having a left and right parenthesis.
  
  The following example parses a negative number input by the user in a Culture aware manner. After parsing, it prints the number as per German format. The output being -100,00.
  
  string str = "(100)";
  int digits = int.Parse(str,NumberStyles.Any,
  System.Threading.Thread.CurrentThread.CurrentCulture.NumberFormat);
  Console.WriteLine(digits.ToString("N",new CultureInfo("de-DE")));
Date and Time formatting
Date formatting varies throughout the world. Although each culture will have a long format and a short date format, the way they are represented differs. The name of months, the character used to separate the days, months and years, the placement of day and month relative to each other, etc vary from one culture to another. Even the number of digits used to represent the year can vary.

Consider the following code snippet which is trying to parse a date string in 'dd/mm/yyyy' format. The date string could have been read from a file or from database.

System.Threading.Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");
DateTime dt = System.DateTime.Parse(@"15\02\2006");
Console.WriteLine(dt.ToLongDateString());

The current culture is set to 'en-US'. Hence, when you parse a date string, the expected format is 'mm/dd/yyyy'. The above code will fail as it is encountering an invalid date which is 'in dd/mm/yyyy' format. While parsing date strings, it is important to know the source of data. For example, if we knew that the date string is from Russian or Indian culture, we can provide that information to the parsing function as outlined below

System.Threading.Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");
DateTime dt = System.DateTime.Parse(@"15\02\2006" , new CultureInfo("hi-IN"));
Console.WriteLine(dt.ToLongDateString());

Similarly, time formats vary based upon the use of 12 hr or 24 hr clock, the character used to separate hours, minutes and seconds, the usage of time zones, etc
Addresses and telephone numbers
Address formats differ across various parts of the world. Some regions may have a 'State'. The number of digits used in postal code will vary. For example, in India, a pin code is represented by 6 digits whereas in France it is 5 digits. Even the name used to represent the code is different. In some places it is known as PIN code whereas in many other places, it is known as postal code.

Similarly, the telephone numbers vary from one region to another. There is no standard format for a telephone number. The differences include the number of digits, the separating characters, the presence/absence of country/state codes, etc.
Units of measure
Units of measure also differ among countries. Even within a single country, multiple units may be used. The units of measurement that vary, relate to the following items
- Lengths
- Weights
- Area
- Volume
- Temperature
- Paper sizes
- Angle notation

Localization Considerations

An Internationalized application can either have multiple binaries each targeting a separate locale or a single, world ready binary which supports multiple locales. The preferred approach is to go for a single, world ready functional core whose features and code design are developed for the input, display and output of a defined set of Unicode-supported language scripts and data related to specific locales. Among the many advantages offered by a single binary are simpler software updates and a single source code and development team. Moreover, all language versions can be shipped simultaneously.

Having decided upon a single world ready binary, the preferred mechanism for localization of the application is to create satellite assemblies. A satellite assembly is the name given to the creation of separate DLL for each language and which contains the translated resources as per that language. Usage of satellite assemblies greatly simplifies the addition of support for a new language without impacting the existing languages.

White labeling
Another important aspect of designing for Internationalization is to allow the user interface to be customizable based upon the region where the product would be sold. Additionally, it is possible that the product may need to integrate with other products belonging to the company. In this case, it is essential that this international product also takes on the same look n feel as that of the parent product. The company logo, background color, background images for dialog boxes can vary based on the location the product is being sold. Similarly, the product logo and the overall look n feel can differ based on whether it is sold independently or along with another product. As an example, products like Winamp allow users to change skins. Each skin gives a unique look n feel to the product.

Testing

There are two basic testing requirements which ensure that the products works well in all regions for which it was developed

Testing to check if the product is locale aware (Globalization)
Testing the localized versions of the product

The following diagrams depicts the breakdown the product with respect to testing for internationalization

Testing for localized content
Localized content includes both text and graphic artifacts on the user interface. Textual content to be localized include the static text on controls such as menus, button, etc and user specific messages. Language experts are required for testing the localized content in each language that the product supports.

Testing for culture awareness
User locale settings define the number, date and currency formatting. Input locales should be handled and user be allowed to input in that language.

In many cases data would need to be converted from one encoding to another. Correct handling and understanding of encoding is required in such cases. If coded incorrectly, it can result in loss of data as well. For example, converting from multibyte character set to single or double-byte character set can result in data loss if not handled properly. User locales do not impact conversion of data from one encoding to another.

Feature based testing
The product may be marketed such that some features are either applicable or not applicable to specific cultures. While testing, it is important to ensure that these features are turned on/off as required while switching locales.

References

Dr. International, Developing International Software, Microsoft Press, 2003.
Unicode and Character sets,
http://www.joelonsoftware.com/articles/Unicode.html
Globalization Best Practices
http://msdn.microsoft.com/library/default.asp?url=/library/enus/cpguide/html/cpconbestpracticesforglobalapplicationdesign.asp
Testing for World-Readiness
http://www.microsoft.com/globaldev/handson/dev/wrtesting.mspx
I18n in Software Design, Architecture and Implementation
http://developers.sun.com/dev/gadc/technicalpublications/articles/archi18n.html