Unicode primer

Characters, code points and encodings

Fortunately for people with names like my family name, there is Unicode. No matter what special characters a name contains, they are all part of unicode set containing 143,859 characters and still counting. But since I recently had to deal with this topic in my job, I had to understand some more details. So this is the first of my unicode related posts, recapping the basics.

Unicode

The goal of Unicode is a big one:

Unicode covers all the characters for all the writing systems of the world, modern and ancient. It also includes technical symbols, punctuations, and many other characters used in writing text. The Unicode Standard is intended to support the needs of all types of users, whether in business or academia, using mainstream or minority scripts.

Although Unicode actually allows you to encode almost any text in the world, I will continue with what I know about and what I need for my job. And this is possibility to encode any name written in latin script.

I will try to use and inroduce the terminology defined in Unicode glossary here. The Unicode code space contains the code points from 0..10FFFF₁₆.

To make clear we are referring to Unicode code points, we use the notation U+0000..U+10FFFF from here on.

Not all code points are assigned to encoded characters, there are several code point types:

  • Graphic
  • Format
  • Control
  • Private-Use
  • Surrogate
  • Noncharacter
  • Reserved

But for now, we are only interested in the “normal” characters. Let' have a look at some examples:

code point name graphical representation remarks
U+0021 EXCLAMATION MARK ! “lowest” graphical character
U+0031 DIGIT ONE 1 -
U+0041 LATIN CAPITAL LETTER A A -
U+00FF LATIN SMALL LETTER Y WITH DIAERESIS ÿ “highest” one byte character
U+FFFD REPLACEMENT CHARACTER “highest” graphical character with two bytes
U+1F9DF ZOMBIE 🧟 definitely a character!
U+1F9E6 SOCKS 🧦 did you know these are a character?

Encodings

The Unicode standard defines several encodings for unicode codepoints, namely UTF-8, UTF-16 and UTF-32. The Unicode standard and those encodings do have a relation to ASCII and ISO/IEC 8859-1: the first 128 characters are the same in all of them and do have the same numerical value. For UTF-8 this means that even the bytes are the same as in ASCII and ISO/IEC 8859-1.

encoding code unit code unit per code point / character remarks
UTF-8 8 bit 1..4 the 128 most common characters only need 1 code unit (= 1 byte)
UTF-16 16 bit 1..2 legacy reasons, 1 code unit for the 64k most common characters
UTF-32 32 bit 1 numerical value always the same as code point, actually only 21 bits needed, many zeroes

Let’s look at our characters from above in the different encodings:

code point UTF-32 UTF-16 UTF-8 ASCII ISO/IEC 8859-1
! U+0021 000021 0021 21 21 21
1 U+0031 000031 0031 31 31 31
A U+0041 000041 0041 41 41 41
ÿ U+00FF 0000ff 00ff c3 bf - ff
U+FFFD 00ffdd ffdd ef bf bd - -
🧟 U+1F9DF 01f9df d83e dddf f0 9f a7 9f - -
🧦 U+1F9E6 01f9e6 d83e dde6 f0 9f a7 a6 - -

Java

Java supported Unicode from the beginning. Java source files can contain any unicode character and can use any encoding (the Java compilers supports). Most of the time, you will find source files to be encoded in UTF-8. Internally, Java uses UTF-16 to represent text.

A char in Java (java.lang.Character) has a value range 0..ffff₁₆! This means it is actually only UTF-16 code unit and not a character!

The reason for this is that Java supported Unicode from the beginning and:

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16

So a CharSequence in Java actually has two kinds of length: the char length and the Unicode code point length:

String char length code point length
“!” 1 1
“1” 1 1
“A” 1 1
“ÿ” 1 1
“�” 1 1
“🧟” 2 1
“🧦” 2 1

What this means in detail, I will write about in my next post.

Standards