My 11/4/07 Missoulian column
Although English is the default language of the Internet, it obviously isn’t the most spoken language in the world, and Unicode is helping to make the Internet – and computers in general – usable for millions of people in their own language.
Unlike many encoding schemes used in the past, Unicode is a standard that allows many different languages to be displayed in a consistent fashion in word processing documents or on Web pages.
One look at the Google News site will give you an idea of how far the use of world languages on personal computers has evolved with Unicode: At the bottom of the page are links to localized Google News sites in French, German, Italian, simplified and traditional Chinese, Korean, Arabic and more.
Fifteen years ago, it wasn’t that easy – you needed to install fonts and keyboard maps and sometimes hardware, even for the different encodings of European languages. If you didn’t, your Web browser or document would show gibberish instead of the language you wanted to read.
Because of Unicode, extensive support for dozens of European and Asian languages is now available out of the box for word processors, Web browsers and the operating system itself on Macintosh-, Windows- and Linux-based computers.
Unicode works by providing the same “codepoint” for each alphabet letter, character or symbol in all supported languages and across all platforms. If a Web site is presented in Unicode, newer browsers will automatically switch to your preferred language and display it correctly. Unicode supports many mathematical and technical symbols, too, and is backward compatible with earlier encoding systems.
Unicode works almost transparently, which is great for the user, but describing Unicode is fairly abstract.
Unicode isn’t like a font in your word processor, though Unicode contains the information – the codepoints – for fonts to be displayed correctly. Unicode isn’t a program, because it doesn’t need to be started, stopped or separately installed.
Unicode and its codepoints are standards – much like the HTML that handles Web pages or SMTP that handles e-mail – that determine how other parts of computers and the Internet work.
Language implementation on personal computers was once much more chaotic.
Years ago, I worked in a refugee camp in India, helping to implement Tibetan and Tibetanized Sanskrit on computers. On IBM-compatible computers (this was before Windows 95 came out) it took special hardware to handle the subscript and superscript characters in command line DOS and in the first version of Windows, 3.1. This was because IBM-compatible computers at the time were English-centric – there was only enough room in the ASCII hardware scheme for 128 characters, which is enough for English and some other European languages, but not Tibetan, Hindi and other complex languages.
In the late 1980s, the quickly growing computer industry figured it was time to try to unite language encodings to ease usage. Along with changes in the hardware to allow more room for encoding, Unicode began to be standardized, and was first released in 1991. The newest version of Unicode, 5, can handle right-to-left languages such as Hebrew, and scripts with complex ligatures such as Arabic, Tibetan, the many languages of the Indian subcontinent and many Asian scripts.
Beyond Google News, language education has greatly benefited from Unicode. If I wanted to go back to studying Chinese as I did when I was an undergraduate, I wouldn’t have to fight with older font and keyboard systems. Unicode, of course, won’t make the work of memorizing characters any easier, and learning to pronounce the four different tones requires a human teacher. But having easier access to resources to read, write and listen to the language makes using that language more accessible.
Unicode is very important for language preservation. By the end of this century, some linguists say, half of the world’s roughly 6,900 languages will be extinct. Hundreds of Native and First Nation languages already are currently classified as endangered with few fluent speakers.
To implement a new or endangered language on a computer – in terms of fonts and keyboard maps – still is very involved and in some cases takes years. But with Unicode, the use and dissemination of that language can progress faster.
Sometimes endangered languages are protected by not being fully entered into Unicode. A language can be considered to be intellectual property, and cultures sometimes seek to protect their language – and the culture within – by staying with older, font-based system. That way, distribution of the language can be controlled while newer software tools are used to teach it.
Unicode has come a long way, but isn’t yet a done deal. It can’t yet overcome the problem of missing fonts, and complex politics are sometimes involved in implementing new languages. Also, providing Unicode is one thing – using it is another. The adoption of Unicode in
e-mail has been slow, and just last month, for example, testing began on Web site domains in Arabic, Japanese and other complex languages enabled by Unicode.
However, the admirable efforts of the thousands of people and organizations in the Unicode Consortium are making the world’s languages work transparently for us.