Fortunately for people with names like my family name, there is Unicode. No matter what special characters a name contains, they are all part of unicode set containing 143,859 characters and still counting. But since I recently had to deal with this topic in my job, I had to understand some more details. So this is the first of my unicode related posts, recapping the basics.
Unicode
The goal of Unicode is a big one:
Unicode covers all the characters for all the writing systems of the world, modern and ancient. It also includes technical symbols, punctuations, and many other characters used in writing text. The Unicode Standard is intended to support the needs of all types of users, whether in business or academia, using mainstream or minority scripts.
Although Unicode actually allows you to encode almost any text in the world, I will continue with what I know about and what I need for my job. And this is possibility to encode any name written in latin script.
I will try to use and inroduce the terminology defined in Unicode glossary here. The Unicode code space contains the code points from 0..10FFFF₁₆.
To make clear we are referring to Unicode code points, we use the notation U+0000..U+10FFFF from here on.
Not all code points are assigned to encoded characters, there are several code point types:
- Graphic
- Format
- Control
- Private-Use
- Surrogate
- Noncharacter
- Reserved
But for now, we are only interested in the “normal” characters. Let' have a look at some examples:
code point | name | graphical representation | remarks |
---|---|---|---|
U+0021 | EXCLAMATION MARK | ! | “lowest” graphical character |
U+0031 | DIGIT ONE | 1 | - |
U+0041 | LATIN CAPITAL LETTER A | A | - |
U+00FF | LATIN SMALL LETTER Y WITH DIAERESIS | ÿ | “highest” one byte character |
U+FFFD | REPLACEMENT CHARACTER | � | “highest” graphical character with two bytes |
U+1F9DF | ZOMBIE | 🧟 | definitely a character! |
U+1F9E6 | SOCKS | 🧦 | did you know these are a character? |
Encodings
The Unicode standard defines several encodings for unicode codepoints, namely UTF-8, UTF-16 and UTF-32. The Unicode standard and those encodings do have a relation to ASCII and ISO/IEC 8859-1: the first 128 characters are the same in all of them and do have the same numerical value. For UTF-8 this means that even the bytes are the same as in ASCII and ISO/IEC 8859-1.
encoding | code unit | code unit per code point / character | remarks |
---|---|---|---|
UTF-8 | 8 bit | 1..4 | the 128 most common characters only need 1 code unit (= 1 byte) |
UTF-16 | 16 bit | 1..2 | legacy reasons, 1 code unit for the 64k most common characters |
UTF-32 | 32 bit | 1 | numerical value always the same as code point, actually only 21 bits needed, many zeroes |
Let’s look at our characters from above in the different encodings:
code point | UTF-32 | UTF-16 | UTF-8 | ASCII | ISO/IEC 8859-1 | |
---|---|---|---|---|---|---|
! | U+0021 | 000021 | 0021 | 21 | 21 | 21 |
1 | U+0031 | 000031 | 0031 | 31 | 31 | 31 |
A | U+0041 | 000041 | 0041 | 41 | 41 | 41 |
ÿ | U+00FF | 0000ff | 00ff | c3 bf | - | ff |
� | U+FFFD | 00ffdd | ffdd | ef bf bd | - | - |
🧟 | U+1F9DF | 01f9df | d83e dddf | f0 9f a7 9f | - | - |
🧦 | U+1F9E6 | 01f9e6 | d83e dde6 | f0 9f a7 a6 | - | - |
Java
Java supported Unicode from the beginning. Java source files can contain any unicode character and can use any encoding (the Java compilers supports). Most of the time, you will find source files to be encoded in UTF-8. Internally, Java uses UTF-16 to represent text.
A char in Java (java.lang.Character) has a value range 0..ffff₁₆! This means it is actually only UTF-16 code unit and not a character!
The reason for this is that Java supported Unicode from the beginning and:
Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16
So a CharSequence in Java actually has two kinds of length: the char length and the Unicode code point length:
String | char length | code point length |
---|---|---|
“!” | 1 | 1 |
“1” | 1 | 1 |
“A” | 1 | 1 |
“ÿ” | 1 | 1 |
“�” | 1 | 1 |
“🧟” | 2 | 1 |
“🧦” | 2 | 1 |
What this means in detail, I will write about in my next post.