Planning World-Ready Applications
Developing world-ready applications requires focused attention to a variety of issues beginning in the application design phase. In addition, you need to determine the extent of world-readiness your application will support.
Your first step in the process of developing a world-ready application is globalization. A globalized application can correctly accept, process, and display a worldwide assortment of scripts, data formats, and languages. However, while your globalized application may possess such flexibility, the language of the user interface remains unchanged. That is, you have not localized the application for another culture/locale.
An intermediate step prior to localization is a process known as localizability. Localizability is about ensuring you have enabled a globalized application for localization by separating the resources requiring localization from the rest of the application. Proper localizability results in source code you will not have to modify during localization.
The final step, localization, is the process of customizing your application for a given culture/locale. Localization consists primarily of translating the user interface.
If you address globalization, localizability, and localization requirements during the design phase, you will maximize the quality of the localized applications and minimize the amount of required time and money. On the other hand, retrofitting existing applications for localization typically results in inferior localized versions, increased cost, and increased time to market.
Overview of Globalization and Localization
In the past, the term localization often referred to a process that began after an application developer compiled the source files in the original language. Another team then began the process of reworking the source files for use in another language. The original language, for example, might be English, and the second language might be German. That approach, however, is prohibitively expensive and results in inconsistencies among versions. It has even caused some customers to purchase the original-language version instead of waiting months for the localized version. A more cost effective and functional model divides the process of developing world-ready applications into three distinct parts, globalization, localizability, and localization.
The primary advantages of designing and implementing your application to be sensitive and appropriate to regional conventions, data in a variety of world languages, and alternate format are:
- You can launch your application onto the market more rapidly. No additional development is necessary to localize an application once the initial version is complete.
- You use resources more efficiently. Implementing world-readiness as part of the original development process requires fewer development and testing resources than if you add the support after the initial development work starts. Furthermore, if you add world-readiness to your finished application, you might make it less stable, compounding problems that you could have resolved earlier.
- Your application is easier to maintain. If you build the localized version of your application from the same set of sources as the original version, only isolated modules need localization. Consequently, it is easier and less expensive to maintain code while including world-readiness. The key to this aspect of designing software rests in using resource files for the localized versions of the application.
- Identifying the culture/locale that must be supported
- Designing features which support those cultures/locales
- Writing code that functions equally well in any of the supported cultures/locales
In other words, globalization adds support for input, display, and output of a defined set of language scripts that relate to specific geographic areas. The most efficient way to globalize these functions is to use the concept of cultures/locales. A culture/locale is a set of rules and a set of data that are specific to a given language and geographic area. These rules and data include information on:
- Character classification
- Date and time formatting
- Numeric, currency, weight, and measure conventions
- Sorting rules
Localizability is an intermediate process for verifying that a globalized application is ready for localization. In an ideal situation, this is only a quality assurance phase. If you designed and developed your application with an eye towards localization, this phase will primarily consist of localizability testing. Otherwise, it is during this phase that you will discover and fix errors in source code that preclude localization. Localizability helps ensure that localization will not introduce any functional defects into the application.
Localizability is also the process of preparing an application for localization. An application prepared for localization has two conceptual blocks, a data block and a code block. The data block exclusively contains all the user-interface string resources. The code block exclusively contains only the application code applicable for all cultures/locales.
In theory, you can develop a localized version of your application by changing only the data block. The code block for all cultures/locales should be the same. The combination of the data block with the code block produces a localized version of your application. The keys to successful world-ready software design and subsequent localization success are:
- Separation of the code block from the data block
- Application’s ability to accurately read data regardless of the culture/locale
Once localizability is complete, your application is ready for localization.
Localization is the process of adapting a globalized application, which you have already processed for localizability, to a particular culture/locale. The process of localizing your application also requires a basic understanding of relevant character sets commonly used in modern software development and an understanding of the issues associated with them. Although all computers store text as numbers (codes), different systems can (and do) store the same text using different numbers. In a general sense, this issue has never been more important than in this era of networks and distributed computing.
The localization process refers to translating the application user interface (UI) or adapting graphics for a specific culture/locale. The localization process can also include translating any help content associated with the application. Most localization teams use specialized tools that aid in the localization process by recycling translations of recurring text and resizing application UI elements to accommodate localized text and graphics.
Localization is the process of customizing your application for a given culture/locale. Localization consists primarily of translating the user interface. Proper planning will help ensure your application is localized in a timely and cost effective manner.
- Determine the working language set — Prior to beginning the localization process, you need to know which languages and cultures you will support in your application. You might need to add features to support localization or locale-specific features for your target markets. For more information, see Globalization.
- Test your application for globalization issues — The goal of globalization testing is to detect potential problems in application design that could inhibit your application while functioning under different cultures/locales. It makes sure that the code can handle all international support without breaking functionality that would cause either data loss or display problems. Globalization testing checks proper functionality of the product with any of the culture/locale settings using every type of international input possible. For more information, see Globalization Testing.
- Create resource-only libraries (DLLs) — The greatest aid to localization is to separate localizable resources from application source code. Separating these resources from source code eliminates the need to recompile source code. For more information, see Isolating Localizable Resources.
- Test your application for localizability issues — Localizability testing verifies that you can easily translate the user interface of the program to any target language without re-engineering or modifying code. Localizability testing catches bugs normally found during product localization, so localization of the program is required to complete this test. As such, localizability testing is essentially a hybrid of globalization testing and localization testing. Successful completion of localizability testing indicates that the product is ready for localization. You can use pseudo-localization to avoid the time and expense of true localization. Pseudo-localization is perhaps the most cost-effective way of finding localizability bugs. For more information, see Localizability Testing.
- Prepare help content — When localizing your application, you should also localize the accompanying help content. Prior to localizing help content, there are steps you can take to reduce the expense of localization. For more information, see Preparing Help Content for Localization.
- Find a localization vendor — After preparing your application for localization, the next challenge is to locate a localization vendor. Microsoft provides help locating a localization vendor. For more information, see the Localization Partners section of the Visual Studio .NET Partner Resources Site (http://msdn.microsoft.com/vstudio/partners/default.asp).
- Determine the localization tool — Most companies use localization tools to translate strings between two languages. Localization tools provide the advantage of translation memories that recycle previously translated strings. For example, if you need to update your software, a translation tool allows you to update it without losing your previous translations. Two general types of translation tools are available — those that work on compiled binary files and those that process text or source code files. If your application has a complex UI that requires a visual designer to resize it, tools that work on the compiled binary files are generally more efficient. Tools that support binary localization do not require you to recompile your source files to develop localized resource (or satellite) DLLs. If your application translation only consists of strings, and visual resizing is not required, then a tool that works on text or source files might be sufficient. When using an external vendor for localization, your vendor might use a preferred localization tool.
- Recycle text from translation memories — If your application or a similar application has been previously localized, you can use a localization tool to import previous translations for recycling. This could reduce the cost of localization by reducing the number of strings a localizer needs to translate.
Isolating Localizable Resources
The greatest aid to localization is to separate localizable resources from application source code. Separating these resources from source code eliminates the need to recompile source code. However, resources that do not require localization should be kept separate from those that do. There are four categories of resources:
- User Interface (UI) – Resources that can usually be localized without any loss of functionality.
- Product adaptation – Resources such as strings that need to be localized. If possible, design the application to query the underlying platform instead of storing information such as currency and date formats within the application. For more information, see Formatting Issues.
- Debug – Resources that should not be localized since it is unlikely that you will ship a debug version of your application.
- Functional – Resources such as strings that cannot be localized without a loss of functionality.
Preparing the User Interface for Localization
Not only must content that appears in the user interface (UI) be localized, but also the UI itself must be capable of displaying the content for each localization instance. Here are some considerations:
- Size the UI to accommodate the largest localized version of the content.
- Do not mingle strings with controls, such as placing a text box in the middle of a sentence. Doing so would require the localization vendor to modify the UI to accommodate grammatical differences that cause sentence structures to change.
- Avoid hiding or overlapping UI controls with other UI controls. Some localization tools are not able to display each state of the UI to identify conflicts with displaying localized controls. Also, adjusting the layout of layered controls is more difficult than adjusting the layout of controls that are not layered.
- Avoid placing button text in a string variable. Doing so might prevent the localization vendor from localizing the string in the appropriate context because they will not be aware of which button the string appears on at run time. Instead, place button text in a property for the button.
- Avoid culture-specific images. A common example of this mistake from earlier UIs is the use of the rural mailbox found in the United States as an icon for mail. This type of mailbox is unfamiliar to some cultures outside of the United States.
- Avoid showing flesh, body parts, or gestures. Exposure of some body parts in one culture might not be acceptable in another. Also, using hand gestures can present problems since an innocent hand gesture in one culture can have an offensive interpretation in another.
- Beware of gender-specific roles and stereotypes in other cultures. The roles for men and women vary across cultures. Also, the portrayed ethnicity or race of an individual can also present problems. If displaying a graphic showing people, it is safer to use one that does not indicate a particular sex, race, or ethnicity.
- Avoid religious preferences. As with race and ethnicity, be very cautious about employing the use of religious symbolism. Some symbols can be innocuous in some cultures and sacrilegious in others.
- Avoid political symbols. In some markets, your application might first require government approval. Avoid graphics such as flags or currency and exercise caution when including maps that include disputed political boundaries or contentious location names.
- Avoid text in graphics. Graphics that include embedded text are expensive and timely to localize, as they generally require the localization vendor to manually edit the graphic.
Preparing Help Content for Localization
- Keep the writing style used in help content simple. Most localization vendors charge by the word and complex sentence structures are more difficult to translate.
- Follow basic writing style principles, such as using consistent terminology.
- Respect cultural and local sensitivity. Create content that does not incorporate slang, jargon, colloquial expressions, or culture-specific metaphors.
- Write for easy recycling and reduced localization costs. For most localization vendors, reusing common sentences results in reduced localization costs since the sentence is only translated once.
- Respect cultural sensitivity in art and multimedia. As with the UI considerations above, help content should not use art or multimedia that does not have global meaning.
- Design the help system with global functionality. The help system software should be designed with the same global considerations as the application it supports.
Globalization and Localization Issues
Bidirectional (Bidi) is the term used to describe text that has scripts that flow both left-to-right (LTR) and right-to-left (RTL). Text that consists of a mixture of English and Arabic is a good example.
There are several issues you must keep in mind when making sure your application is Bidi-aware.
- Internal Data Storage — As mentioned above, Bidi text has LTR and RTL flowing scripts. Although both scripts flow differently, both are stored in the same order from first character to the last character. The best way to envision this is to think of the data stored from the top of a buffer to the bottom.
- Display Stream — Most Latin-based languages are displayed one character at a time. Bidi-text’s different properties of character position, which prescribe script flow and how Arabic ligatures change their shape depending on the preceding and following character, have changed this display formula. Now, it is best to save the currently displayed line in a buffer and then output the whole buffer every time you modify or add a character in the line.
- Line Length — Because of the ligature changes mentioned in the bullet above, it is not a good practice to sum cached character lengths to calculate the length of a line.
A built-in feature of ASCII is that you can create the lowercase and uppercase character of each letter in the English alphabet by adding or subtracting 0x0020 to its corresponding code point:
A[0x0041] + 0x0020 = a[0x0061]
Therefore, converting to either of the cases was a simple addition or subtraction algorithm:
if ((c >= ‘a’) && (c <= ‘z’)) upper = c – 0x0020;
This is not the case for accented Latin characters (A[U+0102], a[U+0103]). You cannot just add or subtract the same value to or from all characters to get their corresponding upper- and lowercase representation.
There are several other reasons why algorithmic solutions for case handling do not cover all occurrences.
- Some languages do not have a one-to-one mapping between upper- and lowercase characters. For example:
· European French accented characters lose their accents in uppercase (é becomes E). However, French-Canadian accented characters keep their accents (é becomes É).
· The uppercase equivalent of the German ß is SS.
- Most non-Latin scripts do not even use the concept of lower- and uppercase. For example:
A code page is a list of selected character codes (characters represented as code points) in a certain order. Code pages are usually defined to support specific languages or groups of languages that share common writing systems. All Windows code pages contain 256 code points. Most of the first 127 code points represent the same characters. This makes it possible for continuity and legacy code. It is the upper 128 code points 128-255 (0-based) where code pages differ considerably.
For example, code page 1253 provides character codes that are required in the Greek writing system. Code page 1250 provides the characters for Latin writing systems including English, German, and French. It is the upper 128 code points that contain either the accent characters or the Greek characters. Consequently, you cannot store Greek and German in the same code stream unless you include some type of identifier that indicates the referenced code page.
Because Chinese, Japanese, and Korean contain more than 256 characters, a different scheme, based on the concept of code pages that contain 256 code points, needed to be developed. The result was Double-Byte Character Sets (DBCS).
In DBCS, a pair of code points (a double-byte) represents each character. For programming awareness, a set of points, which are set aside to represent the first byte of the set, are not valued unless they are immediately followed by a defined second byte. DBCS required code that would treat these pairs of code points as one character. This still disallowed the combination of two languages, for example, Japanese and Chinese, in the same data stream because the same double-byte code points represent different characters depending on the code page.
The special processing required by a complex script can involve one or more of the following characteristics: character reordering; contextual shaping; display of combining characters and diacritics; specialized word break and justification rules; cursor positioning; filtering out illegal character combinations. Scripts considered complex are: Arabic, Hebrew, Thai, Vietnamese, and Indic family.
It is important to respect these following points:
- When displaying typed text, do not output characters one at a time.
- To allocate character/glyph buffers, do not assume one character equals one glyph.
- To measure line lengths, do not sum cached character widths.
Windows has the ability to select an appropriate font to display a particular script. Windows accomplishes this by using a new face name called MS Shell Dlg. MS Shell Dlg is a mapping mechanism that makes it possible for Windows to support cultures/locales that have characters that are not contained in code page 1252. It is not a font, but is instead a face name for a nonexistent font. The MS Shell Dlg face name maps to the default shell font associated with the current culture/locale. For example, in U.S. English Windows 98 this maps to MS Sans Serif. However, in Greek Windows 98, this maps to MS Sans Serif Greek. In U.S. English Windows 2000, it maps to Tahoma. However, MS Shell Dlg does not work on East Asian versions of Windows 9x. For more information, see Localization and the Shell Font.
However, application developers often overlook fonts when creating world-ready applications. Here are two issues that you must watch when dealing with fonts:
- Hard-Coded Font Names — With the use of Unicode, we now deal with thousands of different characters instead of hundreds. Most fonts do not cover all of the Unicode character set. Thus if you hard code a font name that displays English characters and not Japanese, all of your localized Japanese text will display incorrectly. Another reason not to hardcode font names is that the font you want may not be on the system that is displaying your text.
- Hard-Coded Font Sizes — Some scripts are more complex than others. They need more pixels to be displayed properly. For example, most English characters can be displayed on a 5×7 grid, but Japanese characters need at least a 16×16 grid to be clearly seen. Whereas Chinese needs a 24×24 grid, Thai only needs 8 pixels for width but at least 22 pixels for height. Thus, it is easy to understand that some characters may not be legible at a small font size.
The best way to treat font names and sizes is to consider them as another localizable resource. Using MS Shell Dlg solves the problem of running your (any language) application on (any language) Windows NT/Windows 2000. Setting your font as a localizable resource solves the problem of making it possible for your localizer to change the font for the localized UI.
Input Method Editors (IMEs), also called front-end processors, are applets that make it possible for the user to enter the thousands of different characters used in East Asian written languages using a standard 101-key keyboard.
The user composes each character in one of several ways: by radical, by phonetic representation, or by typing in the character’s numeric code page index. IMEs are widely available; Windows ships with IMEs based on the most popular input methods used in each target area.
An IME consists of an engine that converts keystrokes into phonetic and ideographic characters plus a dictionary of commonly used ideographic words. As the user enters keystrokes, the IME engine attempts to convert the keystrokes into an ideographic character or characters.
Because many ideographs have identical pronunciation, the IME engine’s first guess is not always correct. When the suggestion is incorrect, the user can choose from a list of homophones; the homophone that the user selects then becomes the IME engine’s first guess the next time around.
You do not need to use a localized keyboard to enter ideographic characters. While localized keyboards can generate phonetic syllables (such as kana or hangul) directly, the user can represent phonetic syllables using Latin characters.
In Japanese, romaji are Latin characters representing kana. Japanese keyboards contain extra keys that make it possible for the user to toggle between entering romaji and entering kana. If you are using a non-Japanese keyboard, you need to type in romaji to generate kana.
There are three discrete levels of IME support for applications running on Windows: no support, partial support, and fully customized support. Applications can customize IME support in small ways — by repositioning windows, for example — or they can completely change the look of the IME user interface.
- No Support — IME-unaware applications ignore all IME-specific Windows messages. Most applications that target single-byte languages are IME-unaware. Applications that are IME-unaware inherit the default user interface of the active IME through a predefined global class, appropriately called IME. For each thread, Windows automatically creates a window based on the IME global class; all IME-unaware windows of the thread share this default IME window.
- Partial Support — IME-aware applications can create their own IME windows instead of relying on the system default. Applications that contain partial support for IMEs can use these functions to set the style and the position of the IME user interface windows, but the IME DLL is still responsible for drawing them — the general appearance of the IME’s user interface remains unchanged.
- Full Support — In contrast, fully IME-aware applications take over responsibility for painting the IME windows (the status, composition, and candidate windows) from the IME DLL. Such applications can fully customize the appearance of these windows, including determining their screen position and selecting which fonts and font styles are used to display characters in them. This is especially convenient and effective for word processing and similar programs whose primary function is text manipulation and which therefore benefit from smooth interaction with IMEs, creating a "natural" interface with the user.
For more information, see Input Method Editor.
Line-breaking and word-wrapping algorithms are important to text parsing as well as to text display. Western languages typically follow patterns that break lines on hyphenation rules or word boundaries and that break words based on white space (spaces, tabs, end-of-line, punctuation, and so on.).
However, the rules for Asian DBCS languages are quite different from the rules for Western languages. For example, unlike most Western written languages, Chinese, Japanese, Korean, and Thai do not necessarily distinguish one word from the next word by using a space. The Thai language does not even use punctuation.
For these languages, world-ready software applications cannot conveniently base line breaks and word-wrapping algorithms on a space character or on standard hyphenation rules. They must follow different guidelines.
For example, the kinsoku rule determines Japanese line breaking — you can break lines between any two characters with several exceptions:
- A line of text cannot end with any leading characters — such as opening quotation marks, opening parentheses, and currency signs — that should not be separated from succeeding characters.
- A line of text cannot begin with any following characters — such as closing quotation marks, closing parentheses, and punctuation marks — that you should not separate from preceding characters.
- Certain overflow characters (punctuation characters) can extend beyond the right margin for horizontal text or below the bottom margin for vertical text.
Keyboard layouts change according to culture/locale. Some characters do not exist in all keyboard layouts. When assigning shortcut-key combinations, make sure that you can reproduce them using international keyboards, especially if you plan to use the shortcut-key combinations with the Windows 2000 MUI (Multilanguage User Interface).
Because each culture/locale may use a different keyboard, consider using numbers and function keys (F4, F5, and so on) instead of letters in shortcut-key combinations.
Although you do not need to localize number and function-key combinations, they are not as intuitive for the user as letter combinations. Some shortcut keys may not work for each keyboard layout in a particular culture/locale. For example, some cultures/locales use more than one keyboard, such as Eastern Europe and most Arabic-speaking countries/regions.
For Right-To-Left (RTL) languages, not only does the text alignment and text reading order go from right to left, but also the UI layout should follow this natural direction. Of course, this layout change would only apply to localized RTL languages.
Note The .NET Framework does not support mirroring.
Arabic and Hebrew Windows 98 introduced the mirroring technology to resolve the issues with flipping. Windows 2000 uses this same technology. It gives a perfect RTL look and feel to the UI. For Windows 98, this technology is only available on localized Arabic and Hebrew operating systems. However, on Windows 2000 and later, all versions of the operating system are mirroring aware making it possible for you to easily create a mirrored application.
To avoid confusion around coordinates, try to replace the concept of left/right, with the concept of near/far. Mirroring is in fact nothing else than a coordinate transformation:
- Origin (0,0) is in the upper RIGHT corner of a window
- X scale factor = -1 (i.e., values of X increase from right to left)
The following figure illustrates the coordinate transformation from LTR to RTL:
To minimize the amount of re-write needed for applications to support mirroring, system components, such as "GDI" and "User," have been modified to turn mirroring on and off with almost no additional code changes except for a few considerations regarding owner-drawn controls and bitmaps.
For more information, see Window Layout and Mirroring in Window Features.
All applications at some time process data, whether text or numerical. In the past, different culture/locale language requirements meant that applications used diverse encodings to represent this data internally. These encodings have caused fragmented code bases for operating systems and applications (single-byte editions for European languages, double-byte editions for East Asia languages, and bi-directional editions for Middle East Languages). This fragmentation has made it hard to share data and even harder to support a multilingual UI.
Since a goal of globalization is writing code that functions equally well in any of the supported cultures/locales, a data encoding schema that makes it possible for the unique representation of each character in all the required cultures/locales for our products is essential. Unicode meets this requirement.
Unicode makes it possible for the storage of different languages in the same data stream. This one encoding can represent 64,000+ characters. With the introduction of surrogates, it can represent 1,000,000,000+ characters. The use of Unicode in Windows makes it possible for easier creation of world-ready code because you no longer need to reference a code page or group character points to represent one character.
Unicode is a 16-bit international character encoding that covers values for over 45,000 characters (with room for over a million more). Unicode text is usually easier to process than text in other encodings. It also eliminates the need to keep track of which characters are encoded and the need to keep track of the encoding schema that produced the characters.
Note A Unicode-enabled product is still not fully world-ready. In fact, enabling your code to use Unicode is probably only 10 percent of the work.
Using Unicode encoding to represent all international characters enables Windows 2000 to support over 64 scripts and hundreds of languages.