The Magic of Unicode

The Internet is a wonderful place that brings people across the world together into a virtual environment where they can communicate with each other freely. Never before in history has it been so easy to get to know one another and appreciate each other’s languages and cultures. One of the primary reasons that contributes to the popularity of the Internet is that it is easy to use, and hence many people readily adopt it and make it part of their lives.

Despite the universality of the Internet, in actual fact however, dealing with the multitude of languages across the world is anything but easy. Because computers are machines, they naturally deal with bits and bytes on the fundamental level. In order for a computer to be able to store text and numbers that humans can understand, there needs to be a code that transforms characters into numbers.

Before Unicode came into being, every country has their own encoding systems for assigning these numbers for the different characters in their respective languages. The problem with this is that these encoding systems also conflict with one another. A computer, in order to display different languages, needs to support many different encodings, yet whenever data is exchanged and manipulated between computers it is not often straightforward to determine which encoding is in use for which languages. It is often difficult for programs to figure out which encoding scheme they were meant to be using.

To address this problem, the Unicode standard was created. The Unicode Standard provides the capacity to encode almost all of the characters used for the written languages of the world. To keep character coding simple and efficient, the Unicode Standard assigns each character a unique numeric value and name, or code point, and this value is represented in bits. Each character is represented by a 32-bit number, or 4 bytes, at most. This is sufficient for encoding as many as a million characters; this practically covers all historic scripts of the world, as well as common notational systems.

In addition, the Unicode Standard defines three common encoding forms that define how a character is to be transmitted. A character can be transmitted in a byte, word or double word oriented format (UTF-8, UTF-16, UTF-32), note that the numbers 8, 16, 32 represent the number of bits per code unit. Each character is represented not by a single byte, but can be one, two, three, or four bytes, depending on the Unicode Transformation Format (UTF) used and the specific characters involved.  These three encoding forms encode the same common character definitions and can be efficiently transformed into one another without loss of data, and they have the following characteristics:

  • UTF-8: only uses one byte (8 bits) to encode English characters. It can use a sequence of bytes to encode the other characters. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites. UTF-8 is widely used for HTML and on the Internet in general.
  • UTF-16: uses two bytes (16 bits) to encode the most commonly used characters. If needed, the additional characters can be represented by a pair of 16-bit numbers. Currently most modern Windows operating system stores data that is internally represented in UTF-16.
  • UTF-32: uses four bytes (32 bits) to encode the characters. It is capable of representing every Unicode character as one number.

[youtube http://www.youtube.com/watch?v=Z_sl99D2a18&w=480&rel=0&showsearch=0]
Unicode is truly another revolutionary breakthrough that is fundamentally changing the way in how we exchange information across the World Wide Web. The magic of Unicode is that it is no longer necessary to use tricks such as GIFs to represent Chinese or Greek, the whole thing is just plain text which you can copy and paste just like any other text. Computer users who deal with multilingual text will find that the Unicode Standard greatly simplifies their work and makes it easy to exchange multilingual information across the Internet.

Unicode covers many languages that can be written in scripts such as: Latin, Greek, Hebrew, Arabic, Syriac, Thai, Lao, Han (Japanese, Chinese, Korean ideographs), Hiragana, Katakana, etc. Mathematicians and technicians, who regularly use mathematical symbols and other technical characters, will also find the Unicode Standard valuable.

Unicode’s success at unifying character sets has led to its widespread and predominant use in the internationalisation and localisation of computer software. The standard has been implemented in many recent technologies, including XML, the Java programming language, the Microsoft .NET Framework, and many modern operating systems. The emergence of the Unicode Standard, and the availability of tools supporting it, is among the most significant recent global software technology trends.

  1. No comments yet.

  1. No trackbacks yet.