39 Bytes Package and Byte Slice Operations Part 2

39 bytes Package and Byte Slice Operations - Part 2 #

Hello, I’m Haolin. Today, we will continue sharing about the bytes package and byte string operations.

In the previous article, we talked about the approximate functionality of the read count in bytes.Buffer and analyzed it in detail. Now, let’s expand our knowledge on this topic.

Knowledge Expansion #

Question 1: What is the resizing strategy of `bytes.Buffer`? #

The Buffer value can be manually resized or automatically resized. And the resizing strategy for both methods is essentially the same. Therefore, it is recommended to let the Buffer value automatically resize unless we are certain of the number of bytes required for the subsequent content.

During resizing, the corresponding code in the Buffer value (referred to as resizing code hereafter) first checks whether the remaining capacity of the content container can meet the caller’s requirements or accommodate the new content.

If it can, then the resizing code will increase the length of the current content container.

Specifically, if the difference between the capacity of the content container and its length is greater than or equal to the required number of bytes, the resizing code will extend the length of the original content container through slicing, like this:

b.buf = b.buf[:length+need]

If the remaining capacity of the content container is insufficient, the resizing code may replace the original content container with a new one to achieve resizing.

However, there is an optimization step here.

If half of the current capacity of the content container is still greater than or equal to the sum of its current length (i.e. the number of unread bytes) and the required number of bytes:

cap(b.buf)/2 >= b.Len() + need

Then the resizing code will reuse the current content container and copy the unread content in the container to the beginning.

This means that all of the read content will be overwritten by the unread content and the new content that follows.

This reuse is expected to save at least one subsequent memory allocation caused by resizing, as well as several bytes of copying.

If this optimization step cannot be achieved, which means the current capacity of the content container is less than twice the new length.

Then the resizing code needs to create a new content container, copy the unread content from the original container into it, and finally replace the original container with the new one. The capacity of this new container will be equal to twice the original capacity plus the required number of bytes.

New container capacity = 2 * original capacity + required number of bytes

With these steps, the extension of the content container is essentially complete. However, in order to maintain internal data consistency and avoid data confusion that may be caused by previously read content, the resizing code will set the read count to 0 and perform a slicing operation on the content container to hide the previously read content.

By the way, for a Buffer value in a zero value state, if the required number of bytes for the first resizing is not greater than 64, the value will create a content container based on a pre-defined byte array with a length of 64.

In this case, the capacity of the content container will be 64. The purpose of doing this is to quickly prepare the Buffer value for use when it is first used.

Question 2: Which methods in `bytes.Buffer` can potentially cause content leaks? #

First of all, let’s clarify what content leaks mean. Content leaks, in this context, refer to a situation where the user of the Buffer value obtains content that they are not supposed to get through some non-standard (or informal) means.

For example, if I obtain a portion of unread content in the Buffer value by calling a method used to read the content, I should, and can only obtain the unread content in the Buffer value at that moment through the result of that method call. However, after some new content is added to this Buffer value, I can obtain the new content directly based on the result value obtained at that time, without the need to call the corresponding methods again.

This is a typical non-standard reading method. This reading method should not exist, and even if it does, we should not use it. Because it is inadvertently (or carelessly) exposed, its behavior is likely to be unstable.

In the bytes.Buffer, both the Bytes method and the Next method may cause content leakage. The reason is that they both directly return a slice based on the content container to the caller of the method.

We all know that through a slice, we can directly access and manipulate its underlying array. Whether this slice is derived from an array or obtained through slicing another slice, it is the same.

Here, the byte slices returned by the Bytes method and the Next method are obtained by slicing the content container. In other words, they share the same underlying array with the content container, at least for a period of time.

Take the Bytes method as an example. It will return all the unread content in the value it belongs to at the moment of calling. The sample code is as follows:

contents := "ab"
buffer1 := bytes.NewBufferString(contents)
fmt.Printf("The capacity of new buffer with contents %q: %d\n",
contents, buffer1.Cap()) // The capacity of the content container is: 8.
unreadBytes := buffer1.Bytes()
fmt.Printf("The unread bytes of the buffer: %v\n", unreadBytes) // The unread content is: [97 98].

I initialized a Buffer value with a string value "ab", represented by the variable buffer1, and printed out some status of that value at that time.

You may have some confusion, that I only put a string value with a length of 2 into this Buffer value, but why did its capacity become 8.

Although this is irrelevant to our current topic, I can give you a hint: you can go to read a function called stringtoslicebyte in the runtime package, and the answer is in it.

Moving on to buffer1. I wrote the string value "cdefg" to it, and its capacity is still 8. The result value unreadBytes obtained by calling the Bytes method of buffer1 earlier contains all the unread content at that time.

However, since this result value still shares the same underlying array with the content container of buffer1 at this time, I can use this result value to obtain all the unread content of buffer1 at this time through a simple slicing operation. In this way, the new content of buffer1 is leaked.

buffer1.WriteString("cdefg")
fmt.Printf("The capacity of buffer: %d\n", buffer1.Cap()) // The capacity of the content container is still 8.
unreadBytes = unreadBytes[:cap(unreadBytes)]
fmt.Printf("The unread bytes of the buffer: %v\n", unreadBytes) // Based on the previously obtained result value, the unread content is: [97 98 99 100 101 102 103 0].

If I passed the value of unreadBytes to the outside world at that time, the outside world could manipulate the content of buffer1 through this value, like this:

unreadBytes[len(unreadBytes)-2] = byte('X') // The ASCII code for 'X' is 88.
fmt.Printf("The unread bytes of the buffer: %v\n", buffer1.Bytes()) // The unread content becomes: [97 98 99 100 101 102 88].

Now, you should be able to appreciate the serious consequences that can be caused by content leakage, right? The Next method of the Buffer value also has the same issue.

However, if the content container or its underlying array of the Buffer value is reset after the capacity is expanded, the content leakage problem mentioned earlier cannot further develop. I have written a more complete example in the file demo80.go, you can take a look at it and ponder over it.

Summary #

Let’s summarize based on the two articles we have discussed. Unlike the strings.Builder type, the bytes.Buffer not only allows us to concatenate and truncate byte sequences but also enables us to export and sequentially read sub-sequences.

The bytes.Buffer type uses a byte slice as its content container and keeps a field that tracks the number of bytes read in real-time.

Although we cannot directly calculate this count, it plays a critical role in the functionality of the Buffer value, so it is necessary for us to understand it.

Whether it is reading, writing, truncating, exporting, or resetting, the count of bytes read is an important part of the implementation.

Similar to the strings.Builder type, a Buffer value can be manually or automatically expanded. Unless we know the exact number of bytes required for the subsequent content, it is recommended to let the Buffer value automatically expand.

The expansion method of a Buffer value does not always aim to get a larger capacity and replace the existing content container. Instead, it follows the principle of minimizing memory allocation and content copying by reusing the current content container. Only when the capacity is insufficient, will it create a new content container.

In addition, you may not have considered that certain methods of a Buffer value can lead to content leakage. This is mainly because the returned values of these methods will share the underlying array with the Buffer value for a certain period of time.

If we intentionally or unintentionally pass these result values to the outside, it is possible for the outside world to manipulate the content associated with the Buffer value.

This is a serious data security issue. We must avoid this from happening. The most thorough approach is to isolate these values before passing them out. For example, make a deep copy of them before sending out the copies.

Thought Exercise #

Today’s thought exercise is: Compare the String method of strings.Builder and bytes.Buffer, and determine which one is more efficient. What is the reason?

Click here to view the detailed code accompanying the Go language column.