UTF-8 Encode


UTF-8 brief

Computers read bytes and people read characters, so we use encoding standards to map characters to bytes. ASCII was the first widely used standard, but covers only Latin (7 bits/character can represent 128 different characters). Unicode is a standard with the goal to cover all possible characters in the world (can hold up to 1,114,112 characters, meaning 21 bits/character max. Current Unicode 8.0 specifies 120,737 characters in total, and that's all).

The main difference is that an ASCII character can fit to a byte (8 bits), but most Unicode characters cannot. So encoding forms/schemes (like UTF-8 and UTF-16) are used, and the character model goes like this:

Every character holds an enumerated position from 0 to 1,114,111 (hex: 0-10FFFF) called code point.


Why do we need Unicode?

In the (not too) early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today's strange world of global intercommunication and social media was not foreseen, and it is not too unusual to see English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ in the same document (I hope I didn't break any old browsers).

But for argument's sake, lets say Joe Average is a software developer. He insists that he will only ever need English, and as such only wants to use ASCII. This might be fine for Joe the user, but this is not fine for Joe the software developer. Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people, and on top of that, he is closing off his software to a large and growing economy.

Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.


External Links
  • More information about UTF-8 (Wikipedia)