Unicode primer

Fortunately for people with names like my family name, there is Unicode. No matter what special characters a name contains, they are all part of unicode set containing 143,859 characters and still counting. But since I recently had to deal with this topic in my job, I had to understand some more details. So this is the first of my unicode related posts, recapping the basics.

Unicode

The goal of Unicode is a big one:

Unicode covers all the characters for all the writing systems of the world, modern and ancient. It also includes technical symbols, punctuations, and many other characters used in writing text. The Unicode Standard is intended to support the needs of all types of users, whether in business or academia, using mainstream or minority scripts.

Although Unicode actually allows you to encode almost any text in the world, I will continue with what I know about and what I need for my job. And this is possibility to encode any name written in latin script.

I will try to use and inroduce the terminology defined in Unicode glossary here. The Unicode code space contains the code points from 0..10FFFF₁₆.

To make clear we are referring to Unicode code points, we use the notation U+0000..U+10FFFF from here on.

Not all code points are assigned to encoded characters, there are several code point types:

Graphic
Format
Control
Private-Use
Surrogate
Noncharacter
Reserved

But for now, we are only interested in the “normal” characters. Let' have a look at some examples:

code point	name	graphical representation	remarks
U+0021	EXCLAMATION MARK	!	“lowest” graphical character
U+0031	DIGIT ONE	1	-
U+0041	LATIN CAPITAL LETTER A	A	-
U+00FF	LATIN SMALL LETTER Y WITH DIAERESIS	ÿ	“highest” one byte character
U+FFFD	REPLACEMENT CHARACTER	�	“highest” graphical character with two bytes
U+1F9DF	ZOMBIE	🧟	definitely a character!
U+1F9E6	SOCKS	🧦	did you know these are a character?

Encodings

The Unicode standard defines several encodings for unicode codepoints, namely UTF-8, UTF-16 and UTF-32. The Unicode standard and those encodings do have a relation to ASCII and ISO/IEC 8859-1: the first 128 characters are the same in all of them and do have the same numerical value. For UTF-8 this means that even the bytes are the same as in ASCII and ISO/IEC 8859-1.

encoding	code unit	code unit per code point / character	remarks
UTF-8	8 bit	1..4	the 128 most common characters only need 1 code unit (= 1 byte)
UTF-16	16 bit	1..2	legacy reasons, 1 code unit for the 64k most common characters
UTF-32	32 bit	1	numerical value always the same as code point, actually only 21 bits needed, many zeroes

Let’s look at our characters from above in the different encodings:

	code point	UTF-32	UTF-16	UTF-8	ASCII	ISO/IEC 8859-1
!	U+0021	000021	0021	21	21	21
1	U+0031	000031	0031	31	31	31
A	U+0041	000041	0041	41	41	41
ÿ	U+00FF	0000ff	00ff	c3 bf	-	ff
�	U+FFFD	00ffdd	ffdd	ef bf bd	-	-
🧟	U+1F9DF	01f9df	d83e dddf	f0 9f a7 9f	-	-
🧦	U+1F9E6	01f9e6	d83e dde6	f0 9f a7 a6	-	-

Java

Java supported Unicode from the beginning. Java source files can contain any unicode character and can use any encoding (the Java compilers supports). Most of the time, you will find source files to be encoded in UTF-8. Internally, Java uses UTF-16 to represent text.

A char in Java (java.lang.Character) has a value range 0..ffff₁₆! This means it is actually only UTF-16 code unit and not a character!

The reason for this is that Java supported Unicode from the beginning and:

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16

So a CharSequence in Java actually has two kinds of length: the char length and the Unicode code point length:

String	char length	code point length
“!”	1	1
“1”	1	1
“A”	1	1
“ÿ”	1	1
“�”	1	1
“🧟”	2	1
“🧦”	2	1

What this means in detail, I will write about in my next post.

Standards