37 Strings Package and String Operations

37 strings Package and String Operations #

In the previous article, I introduced the Go language, the Unicode encoding specification, the origin of the UTF-8 encoding format, and their applications.

The Go language not only has the rune type, which can independently represent Unicode characters, but also has the for statement which can split string values into Unicode characters.

In addition, the unicode package and its sub-packages in the standard library provide many functions and data types that can help us parse Unicode characters in various contents.

These program entities are very useful, simple and clear, and effectively hide some of the complex details of the Unicode encoding specification. I won’t go into the details here.

Today, we are mainly going to talk about the strings code package in the standard library. This code package also uses many program entities from the unicode package and the unicode/utf8 package.

  • For example, the WriteRune method of the strings.Builder type.

  • Another example is the ReadRune method of the strings.Reader type, and so on.

The following question is specifically about the strings.Builder type. Our question for today is: What are the advantages of the strings.Builder type compared to a string value?

The typical answer to this question is as follows:

The strings.Builder type (hereinafter referred to as the Builder value) has the following advantages:

  • Existing content is immutable, but more content can be appended;
  • Reduces the number of memory allocations and content copies;
  • Content can be reset and the value can be reused.

Problem Analysis #

Let’s talk about the string type first. As we all know, in the Go language, values of type string are immutable. If we want to obtain a different string, we can only perform operations such as trimming and concatenation based on the original string to generate a new string.

  • Trimming can be done using slice expressions;
  • Concatenation can be achieved using the + operator.

Under the hood, the content of a string value is stored in a contiguous block of memory. At the same time, the number of bytes held by this block of memory is also recorded and used to represent the length of the string value.

You can think of the content of this block of memory as a byte array, and the corresponding string value contains a pointer value pointing to the head of the byte array. In this way, when we apply slice expressions to a string value, it is equivalent to slicing the underlying byte array.

In addition, when we concatenate strings, Go language will copy all the concatenated strings into a completely new and sufficiently large contiguous memory space, and return the resulting string value that holds the respective pointer value.

Obviously, when there are too many string concatenation operations in a program, it will put a great deal of pressure on memory allocation.

Note that although a string value holds a pointer value internally, its type still belongs to a value type. However, because string values are immutable, the pointer value inside them also contributes to memory savings.

Specifically, a string value shares the same byte array with all its copies. Since this byte array is never modified, doing so is absolutely safe.

Compared with string values, the advantage of Builder values is mainly reflected in string concatenation.

The Builder value has a container (hereafter referred to as the content container) used to hold content. It is a slice of byte elements (hereafter referred to as the byte slice).

Since the underlying array of such a byte slice is a byte array, we can say that it is similar to the way string values store content.

In fact, they both use an unsafe.Pointer type field to hold the pointer value pointing to the underlying byte array.

It is precisely because of this internal structure that Builder values also have the premise of efficiently using memory. Although any element value contained in the byte slice itself can be modified, the Builder value does not allow this. Its content can only be concatenated or completely reset.

This means that the content already present in the Builder value is immutable. Therefore, we can use the methods provided by the Builder value to concatenate more content without worrying about these methods affecting the existing content.

The methods mentioned here refer to a series of pointer methods owned by the Builder value, including: Write, WriteByte, WriteRune, and WriteString. We can collectively call them concatenation methods.

We can use these methods to append new content to the end (i.e., the right side) of the existing content in the Builder value. If necessary, the Builder value will automatically resize its content container. The automatic resizing strategy here is consistent with the resizing strategy of slice.

In other words, when we append content to the Builder value, it does not necessarily cause resizing. As long as the capacity of the content container is sufficient, resizing will not occur, and the corresponding memory allocation will not happen. At the same time, as long as there is no resizing, the existing content in the Builder value will not be copied again.

In addition to the automatic resizing of the Builder value, we can also choose manual resizing by calling the Grow method of the Builder value. The Grow method can also be called the resizing method, which takes an int type parameter n representing the number of bytes to be expanded.

If necessary, the Grow method will increase the capacity of the content container of the value by n bytes. Specifically, it will generate a byte slice as a new content container, and the capacity of this slice will be twice the original capacity plus n. After that, it will copy all the bytes in the original container to the new container.

var builder1 strings.Builder
// Omitted some code.
fmt.Println("Grow the builder ...")
builder1.Grow(10)
fmt.Printf("The length of contents in the builder is %d.\n", builder1.Len())

Of course, the Grow method may do nothing. The premise for this situation is that the unused capacity in the current content container is already sufficient, i.e., the unused capacity is greater than or equal to n. This premise condition is similar to the one mentioned earlier regarding the automatic resizing strategy.

fmt.Println("Reset the builder ...")
builder1.Reset()
fmt.Printf("The third output(%d):\n%q\n", builder1.Len(), builder1.String())

Finally, the Builder value can be reused. By calling its Reset method, we can make the Builder value return to its zero value state, as if it had never been used before.

Once reused, the original content container in the Builder value will be directly discarded. After that, it and all its content will be marked and reclaimed by the Go language garbage collector.

Knowledge Expansion #

Question 1: Are there any constraints on using the strings.Builder type? #

The answer is: Yes, there are constraints, summarized as follows:

  • Once it has been used, it cannot be copied anymore;
  • Since its content is not completely immutable, the user needs to resolve operation conflicts and concurrency safety.

Once we call the concatenation or expansion methods of the Builder value, it means that we have started to use it. It is obvious that these methods will change the state of the content container of the Builder value.

Once they are called, we can no longer make any copies of the Builder value in any way. Otherwise, calling the above methods on any copy will cause a panic.

This panic tells us that this usage is not legal because the Builder value here is a copy, not the original value. By the way, the copying methods mentioned here include but are not limited to passing values between functions, passing values through channels, assigning values to variables, and so on.

var builder1 strings.Builder
builder1.Grow(1)
builder3 := builder1
//builder3.Grow(1) // This will cause a panic.
_ = builder3

Although this constraint is very strict, if we think about it carefully, we will find that it is beneficial.

Because the Builder values that have been used cannot be copied anymore, it is certain that multiple Builder values will not share the underlying byte array of the content container. This also avoids the conflicts that may occur when multiple Builder values of the same origin concatenate content.

However, although the Builder value that has been used cannot be copied, its pointer value can be. At any time, we can copy such a pointer value through any means. Note that such pointer values will always point to the same Builder value.

f2 := func(bp *strings.Builder) {
 (*bp).Grow(1) // This will not cause a panic, but it is not concurrency-safe.
 builder4 := *bp
 //builder4.Grow(1) // This will cause a panic.
 _ = builder4
}
f2(&builder1)

Because of this, a problem arises: if the Builder value is operated on by multiple parties at the same time, the content in it is likely to become chaotic. This is what we call operation conflicts and concurrency safety issues.

The Builder value itself cannot solve these problems. Therefore, when sharing the Builder value by passing its pointer value, we must ensure that all parties use it correctly, in order, and in a concurrency-safe manner. The most thorough solution is never to share the Builder value and its pointer value.

We can declare a Builder value separately in various places for use, or we can declare a Builder value first and then pass its copy to various places before actually using it. In addition, we can use it first and then pass it, just call its Reset method before passing it.

builder1.Reset()
builder5 := builder1
builder5.Grow(1) // This will not cause a panic.

In summary, the constraint of copying the Builder value is meaningful and necessary. Although we can still share the Builder value in certain ways, it is better not to take risks. “Respectively” is the best solution. However, there is no problem with copying a Builder value in zero-value state.

Question 2: Why is the strings.Reader type said to efficiently read strings? #

Contrary to the strings.Builder type, the strings.Reader type exists to efficiently read strings. Its efficiency is mainly reflected in its reading mechanism for strings, encapsulating many best practices for reading content on string values.

Values of the strings.Reader type (hereinafter referred to as Reader values) allow us to conveniently read the content of a string. In the process of reading, the Reader value keeps track of the number of bytes already read (hereinafter referred to as the read count).

The read count also represents the starting index position for the next read. The Reader value relies on this count and string slicing expressions to achieve fast reading.

In addition, this read count is also an important basis for reading rollback and position setting. Although it belongs to the internal structure of the Reader value, we can still calculate it through its Len method and Size. The code is as follows:

var reader1 strings.Reader
// Omitted some code.
readingIndex := reader1.Size() - int64(reader1.Len()) // Calculated read count.

Most of the methods for reading that the Reader value has will promptly update the read count. For example, the ReadByte method will increase the value of this count by 1 after a successful read.

For another example, the ReadRune method will use the number of bytes occupied by the read character as the increment of the count after a successful read.

However, the ReadAt method is an exception. It does not read based on the read count, nor does it update it after reading. Because of this, this method can freely read any content in the Reader value it belongs to.

In addition, the Seek method of the Reader value will also update the read count of the value. In fact, the main function of this Seek method is to set the starting index position for the next read.

Moreover, if we pass the value of the constant io.SeekCurrent as the second argument value to this method, it will calculate the new count value based on the current read count and the value of the first argument offset.

Since the Seek method returns the new count value, we can easily verify this. For example, as shown below:

offset2 := int64(17)
expectedIndex := reader1.Size() - int64(reader1.Len()) + offset2
fmt.Printf("Seek with offset %d and whence %d ...\n", offset2, io.SeekCurrent)
readingIndex, _ := reader1.Seek(offset2, io.SeekCurrent)
fmt.Printf("The reading index in reader: %d (returned by Seek)\n", readingIndex)
fmt.Printf("The reading index in reader: %d (computed by me)\n", expectedIndex)

In conclusion, the key to Reader values for efficient reading lies in the internal read count. The value of the count represents the starting index position for the next read, and it can be easily calculated. The Seek method of the Reader value can directly set the read count value of the value.

Summary #

Today, we mainly discussed two important types in the strings package, namely: Builder and Reader. The former is used to build strings, while the latter is used to read strings.

Compared to string values, the advantage of Builder values is mainly reflected in string concatenation. It can concatenate more content while keeping the existing content unchanged, and it will minimize the number of memory allocations and content copies during the concatenation process.

However, such values also have constraints in usage. Once they are used, they cannot be copied again, otherwise it will cause a panic. Although this constraint is strict, it can also bring some benefits. It can effectively avoid some operation conflicts. Although we can bypass this constraint through some means (such as passing its pointer value), the disadvantages outweigh the benefits. The best solution is to declare separately, use separately, and do not interfere with each other.

Reader values allow us to easily read the contents of a string. Its efficiency is mainly reflected in its mechanism for reading strings. During the reading process, the Reader value will keep track of the number of bytes read, also known as the read count.

This count represents the starting index position for the next read, and it is also the key to efficient reading. With the Len method and Size method of such values, we can calculate the value of the read count. With it, we can more flexibly read strings.

I only introduced these two data types in this article, but it does not mean that the strings package only has these two useful entities. In fact, the strings package also provides a lot of functions. For example:

Count, IndexRune, Map, Replace, SplitN, Trim, etc.

They are all very easy to use and efficient. You can take a look at their source code, maybe you will gain some insights from it.

Thinking question #

Today’s thinking question is: strings.Builder and strings.Reader each implement which interfaces? What are the benefits of doing so?

Click here to view the detailed code accompanying the Go language column article.