36 Unicode and Character Encoding

36 Unicode and Character Encoding #

Up to now, we have learned together the most important and characteristic concepts, syntax, and programming styles in Go. I am really fond of them and can describe them as treasures.

Before we start today’s content, let me make a simple summary.

Summary of Classic Go Language Knowledge #

The model of concurrent programming based on hybrid threads goes without saying.

In terms of data types, we have:

Slices based on underlying arrays;
Channels for data passing;
Functions as first-class types;
Structs that can implement object-oriented programming;
Interfaces that can be implemented non-invasively, etc.

In terms of syntax, we have:

The asynchronous programming tool go statement;
The ultimate checkpoint of a function, the defer statement;
The switch statement that allows type checking;
The powerful tool for multiple channel operations, the select statement;
The very distinctive exception handling functions, panic and recover.

Apart from these, we also discussed the main ways to test Go programs. This involves the built-in testing suite in Go, which includes:

Separate source code files for tests;
Three types of test functions with different purposes;
The dedicated testing package;
The powerful go test command.

In addition, not long ago, I also provided an in-depth explanation of the synchronization tools offered by Go. They are an indispensable part of Go’s concurrency programming toolbox. These include:

The classic mutex lock;
Read-write locks;
Condition variables;
Atomic operations.

And some data types unique to Go, namely:

The helper for asserting once, sync.Once;
The temporary object pool, sync.Pool;
The sync.WaitGroup and context.Context that help us implement cooperative workflows among multiple goroutines;
The highly efficient and concurrent safe map, sync.Map.

To put it mildly, if you truly master the above knowledge, then you have already grasped the essence of Go programming.

Afterwards, when you study the code in the Go standard library and those excellent third-party libraries, you will definitely achieve twice the result with half the effort. Moreover, when you write software using Go, you will surely feel comfortable and at ease.

I have spent a lot of time explaining the most core knowledge points in Go, and I sincerely hope that you have understood these contents.

In the days to come, I will explore with you the most commonly used packages in the Go standard library, understand their usage and mechanisms. Of course, I will also briefly discuss those essential peripheral knowledge.

Introduction 1: Basics of Go Language Character Encoding #

First, let’s focus on the issue of character encoding. This should be a very basic question in the field of computer software.

As I mentioned earlier, identifiers in Go language can contain “any letter character that can be represented by Unicode encoding”. I also mentioned that we can directly convert an integer value to a value of type string.

However, the converted integer value should represent a valid Unicode code point, otherwise the result of the conversion will be "�", which is a string consisting of highlighted question marks.

In addition, when a value of type string is converted to a value of type []rune, the string is split into individual Unicode characters.

Obviously, the character encoding scheme used in Go language is based on the Unicode encoding specification. More precisely, Go language code is composed of Unicode characters. All source code in Go language must be encoded in UTF-8 format according to the Unicode encoding specification.

In other words, Go language source code files must be stored in UTF-8 format. If non-UTF-8 encoded characters appear in the source code files, the go command will report an error “illegal UTF-8 encoding” during building, installation, and runtime.

Here, we need to first understand the Unicode encoding specification. However, before discussing it, let me briefly introduce ASCII encoding.

Introduction 2: ASCII Encoding #

ASCII stands for “American Standard Code for Information Interchange”. It is a single-byte character encoding scheme developed by the American National Standards Institute (ANSI) and can be used for text-based data exchange.

It was initially a national standard in the United States and later became an international standard by the International Organization for Standardization (ISO), known as ISO 646, and applicable to all Latin alphabet letters.

The ASCII encoding scheme uses a single byte of binary numbers to encode a character. Standard ASCII encoding uses the highest bit of a byte as a parity bit, while extended ASCII encoding uses this bit to represent characters. The set of printable characters and control characters supported by ASCII encoding is also called the ASCII character set.

The Unicode encoding specification we mentioned is actually another more general character encoding standard for written characters and text. It assigns a unique binary encoding to each character in every existing natural language.

It defines a standardized way for text data in different natural languages to be exchanged internationally and provides an important foundation for globalized software.

The Unicode encoding specification is based on the ASCII character set and goes beyond the limitation of only encoding Latin letters in ASCII. It not only provides the ability to encode over a million characters in the world, but also supports all known escape sequences and control codes.

As we all know, abstract characters in a computer system are encoded as integers. The range of these integers is called the code space. Within the code space, each specific integer is called a code point.

A supported abstract character is mapped and assigned to a specific code point, and conversely, a code point can always be regarded as an encoded character.

The Unicode encoding specification usually uses hexadecimal notation to represent the integer value of a Unicode code point, with “U+” as a prefix. For example, the Unicode code point for the English letter “a” is U+0061. In the Unicode encoding specification, a character can and can only be represented by the corresponding code point.

The latest version of the Unicode encoding specification is 11.0 and will be released in March 2019 as version 12.0. Starting from version 1.10, Go language has provided comprehensive support for Unicode 10.0. For the majority of application scenarios, this is already sufficient.

The Unicode encoding specification provides three different encoding formats: UTF-8, UTF-16, and UTF-32. The abbreviation “UTF” stands for UCS Transformation Format. UCS can be translated as Universal Character Set, but it can also represent Unicode Character Set. Therefore, UTF can also be translated as Unicode Transformation Format. It represents the way characters are converted into byte sequences.

In the names of these encoding formats, the integer on the right side of “-” represents the number of bits used as an encoding unit. Taking UTF-8 as an example, it uses 8 bits, which is one byte, as an encoding unit. Moreover, it is completely compatible with standard ASCII encoding. This means that in the range [0x00, 0x7F], the characters represented by these two encodings are the same. This is a huge advantage of the UTF-8 encoding format.

UTF-8 is a variable-width encoding scheme. In other words, it uses one or more bytes of binary numbers to represent a character, up to a maximum of four bytes. For example, for an English character, it can be represented by only one byte of binary numbers, while for a Chinese character, it needs three bytes to represent. In any case, a supported character can always be encoded into a byte sequence by UTF-8. The latter will be referred to as the UTF-8 encoded value.

Now, after you have a preliminary understanding of this knowledge, please carefully consider and answer the following question. Don’t worry, I will further explain how Unicode, UTF-8, and Go language apply them later.

Question: How is a value of type string expressed at the underlying level?

A typical answer is that at the underlying level, a value of type string is expressed by a series of UTF-8 encoded values of corresponding Unicode code points.

Problem Analysis #

In Go language, a string type value can be split into a sequence of characters or a sequence of bytes.

The former can be represented by a slice with rune as its element type, while the latter can be represented by a slice with byte as its element type.

rune is a unique basic data type in Go language, where one value represents one character, i.e.: one Unicode character.

For example, 'G', 'o', '爱', '好', '者' all represent one Unicode character.

We already know that the UTF-8 encoding scheme encodes one Unicode character into a byte sequence with a length ranging from 1 to 4 bytes. Therefore, a value of rune type can also be represented by one or more bytes.

type rune = int32

According to the declaration of rune type, it is actually an alias type of int32. In other words, a value of rune type is stored in four bytes of space. Its storage space is always capable of holding a UTF-8 encoding value.

A value of rune type is actually a UTF-8 encoding value at the underlying level. The former is an (easier to understand for humans) external representation, while the latter is an (easier to comprehend for computer systems) internal representation.

Please take a look at the following code:

str := "Go爱好者"
fmt.Printf("The string: %q\n", str)
fmt.Printf("  => runes(char): %q\n", []rune(str))
fmt.Printf("  => runes(hex): %x\n", []rune(str))
fmt.Printf("  => bytes(hex): [% x]\n", []byte(str))

If the string value "Go爱好者" is converted into a value of []rune type, then each character (regardless of whether it is an English character or a Chinese character) will become an element value of rune type. Therefore, the second line of output from this code will be shown as follows:

=> runes(char): [‘G’ ‘o’ ‘爱’ ‘好’ ‘者’]

Moreover, since each value of rune type is actually expressed by a UTF-8 encoding value at the underlying level, we can present this character sequence in another way:

=> runes(hex): [47 6f 7231 597d 8005]

As you can see, five hexadecimal numbers correspond to the five characters. Obviously, the first two hexadecimal numbers, 47 and 6f, represent relatively small integers and separately represent characters 'G' and 'o'.

Since they are English characters, their UTF-8 encoding values can be represented by a single byte. After converting a byte of encoding value into an integer, it won’t be too big.

On the other hand, the last three hexadecimal numbers, 7231, 597d, and 8005, are relatively large, and they correspond to Chinese characters '爱', '好', and '者', respectively.

The corresponding UTF-8 encoding values for these Chinese characters need to be represented using three bytes. Therefore, these three numbers are the results of converting the encoding values of the corresponding three bytes into integers.

We can further split and present each character’s UTF-8 encoding value as a byte sequence. The fifth line of code above achieves this. It will produce the following output:

=> bytes(hex): [47 6f e7 88 b1 e5 a5 bd e8 80 85]

The byte slice obtained here is obviously much longer than the character slice mentioned earlier. This is because the UTF-8 encoding value of a Chinese character needs to be represented by three bytes.

The first two element values of the byte slice are the same as those of the character slice. After that, every three element values of the former correspond to one element value of the character slice.

Note that for a multi-byte UTF-8 encoding value, we can either convert it as a whole into a single integer, or we can first split it into a byte sequence, and then convert each byte into an integer separately, thereby obtaining multiple integers.

The content displayed by these two representations often varies greatly. For example, for the Chinese character '爱', its UTF-8 encoding value can be presented as a single integer 7231, or it can be presented as three integers: e7, 88, and b1.

-(Underlying representation of a string value)

In summary, a value of string type consists of multiple Unicode characters, and each Unicode character can be carried by a value of rune type.

These characters are all converted into UTF-8 encoding values at the underlying level, and these UTF-8 encoding values are expressed and stored in the form of byte sequences. Therefore, a value of string type at the underlying level is a byte sequence capable of expressing multiple UTF-8 encoding values.

Knowledge Expansion #

Question 1: What should be noted when using a for statement with a range clause to iterate over string values?

When using a for statement with a range clause, the string value being iterated will be first broken into a sequence of bytes. Then, it will attempt to identify each UTF-8 encoding value or Unicode character contained within this byte sequence.

This for statement allows for assigning two iteration variables. If two iteration variables exist, the value assigned to the first variable will be the index corresponding to the first byte of a UTF-8 encoding value within the current byte sequence.

The value assigned to the second variable will be the Unicode character represented by this UTF-8 encoding value, with the type of the variable being rune.

For example, consider the following code:

str := "Go爱好者"
for i, c := range str {
 fmt.Printf("%d: %q [% x]\n", i, c, []byte(string(c)))
}

Here, the string value being iterated is "Go爱好者". During each iteration, this code will print out the values of the two iteration variables and the byte sequence representation of the second value. The complete output is as follows:

0: 'G' [47]
1: 'o' [6f]
2: '爱' [e7 88 b1]
5: '好' [e5 a5 bd]
8: '者' [e8 80 85]

In the first line, the key information is 0, 'G', and [47]. This is because the first Unicode character in this string value is 'G'. This character is a single-byte character and is represented by the first byte in the corresponding byte sequence. The hexadecimal representation of this byte is 47.

The content displayed in the second line is similar: the second Unicode character is 'o', which is represented by the second byte in the byte sequence. Its hexadecimal representation is 6f.

Moving on, the third line displays '爱' as the third Unicode character. Since it is a Chinese character, it is represented by a sequence of three bytes in the byte sequence: e7, 88, and b1. Therefore, its hexadecimal representation is not a single integer but a sequence of these bytes.

It’s important to note that because '爱' is represented by three bytes, the index value for the fourth Unicode character '好' is not 3, but 2 plus the width of the UTF-8 encoding value for '爱', which is 3. Similarly, the same applies to the last character '者' in this string value, making its index value 8.

From this, we can see that this for statement can iterate over each Unicode character in the string value individually. However, the index values of consecutive Unicode characters may not be continuous. This depends on whether the previous Unicode character is a single-byte character.

Because of this behavior of the for statement, beginners may find it confusing that the values assigned to the two iteration variables do not always correspond. However, once we understand the underlying mechanism, everything becomes clear.

Summary #

Today, we focused on the Unicode encoding standard, UTF-8 encoding format, and the handling of strings and characters in the Go language.

Go code is made up of Unicode characters, which must be encoded and stored using the UTF-8 encoding format specified in the Unicode encoding standard, otherwise, it will result in an error when running the go command.

The encoding format defined in the Unicode encoding standard specifies how characters are converted to byte sequences. Among them, UTF-8 is a variable-width encoding scheme.

It uses one or more bytes of binary numbers to represent a character, using a maximum of four bytes. A supported character can always be encoded as a byte sequence in UTF-8, which can also be called a UTF-8 encoding value.

A string type value in Go is made up of multiple Unicode characters, each of which can be represented by a rune type value.

These characters are converted to UTF-8 encoding values at the underlying level, and these UTF-8 encoding values are expressed and stored as byte sequences. Therefore, a string type value is essentially a byte sequence that can express multiple UTF-8 encoding values at the underlying level.

Beginners may find the behavior of a for statement with a range clause that iterates over a string value confusing because the values of the two iteration variables do not always seem to correspond. But that’s not the case.

Such a for statement first breaks the iterated string value into a byte sequence, and then tries to identify each UTF-8 encoding value, or each Unicode character, contained in this byte sequence.

The index values of adjacent Unicode characters are not necessarily continuous. This depends on whether the previous Unicode character is a single-byte character. Once we understand these underlying mechanisms, we will no longer be confused.

For Go, the Unicode encoding standard and UTF-8 encoding format are considered fundamentals. We should understand their importance to the Go language. This will be beneficial for correctly understanding related data types in Go and for future program development.

Thought Question #

Today’s thought question is: What are the usual ways to determine whether a Unicode character is a single-byte character?

Click here to view the detailed code accompanying the Go language column article.