17 Slice Header How to Efficiently Handle Data With Slice

17 SliceHeader - How to Efficiently Handle Data with slice #

In [Lecture 4 | Collection Types: How to Use Arrays, Slices, and Maps Correctly?], you have learned about slices and how to use them. In this lesson, I will elaborate on the principles of slices and take you through their underlying design.

Arrays #

Before delving into the principles of slices, let’s first introduce arrays. Arrays are present in almost all programming languages, including Go. So why did Go choose to design slices in addition to arrays? To answer this question, let’s first understand the limitations of arrays.

In the following examples, a1 and a2 are two predefined arrays, but their types are different. The type of variable a1 is [1]string, while the type of variable a2 is [2]string. In other words, the size of an array is part of its type. Only when the type and size of the elements inside the array are the same, the two arrays are of the same type.

a1 := [1]string{"飞雪无情"}

a2 := [2]string{"飞雪无情"}

We can summarize that an array consists of two parts: the size of the array and the type of its elements.

// Pseudo code representation of an array structure

array {

  length

  item type

}

For example, the size of array a1 is 1, and the type of its elements is string. Therefore, a1 can only store a maximum of 1 element of type string. Similarly, the size of array a2 is 2, and the type of its elements is also string. Hence, a2 can store a maximum of 2 elements of type string. Once an array is declared, its size and the type of its elements cannot be changed. You cannot arbitrarily add multiple elements to an array. This is the first limitation of arrays.

Since the size of an array is fixed, if you need to store a large amount of data in an array, you need to specify a suitable size in advance, such as 100,000. Here is an example:

a10 := [100000]string{"飞雪无情"}

Although this approach can solve the problem, it introduces another issue: memory consumption. In Go, when passing arrays as arguments between functions, parameter passing is done by value. Thus, the same content will be repeatedly copied when the array is passed between functions. This leads to significant memory waste, which is the second limitation of arrays.

Despite the limitations, arrays are still an essential underlying data structure in Go. For instance, the underlying data of a slice is stored in an array.

Slices #

You already know that arrays, although useful, have certain restrictions in terms of operations. To overcome these limitations, Go created slices. A slice is an abstraction and encapsulation of an array. Its underlying data is an array that stores all the elements. However, a slice can dynamically add elements and automatically expand its capacity when needed. You can think of a slice as a dynamic array. In Go, except for cases where a type with a specified length is required, most situations use slices instead of arrays.

Dynamic Expansion #

Using the built-in append function, you can append any number of elements to a slice, thereby solving the first limitation of arrays.

In the following example, I use the append function to add two strings to the slice ss and assign the new slice to ss.

func main() {

   ss := []string{"飞雪无情", "张三"}

   ss = append(ss, "李四", "王五")

   fmt.Println(ss)

}

When you run this code, you will see the following output:

[飞雪无情 张三 李四 王五]

When appending elements using append, if the capacity of the slice is insufficient, the append function will automatically expand the capacity. For example, in the above example, I print the length and capacity of the slice before and after using append, as shown in the following code:

func main() {

   ss := []string{"飞雪无情", "张三"}

   fmt.Println("Length of slice ss is", len(ss), ", capacity is", cap(ss))

   ss = append(ss, "李四", "王五")

   fmt.Println("Length of slice ss is", len(ss), ", capacity is", cap(ss))

   fmt.Println(ss)

}

In this code, I use the built-in len function to get the length of the slice and the cap function to get its capacity. When you run this code, you will see the following output:

The length of slice `ss` is 2 and the capacity is 2.

The length of slice `ss` is 4 and the capacity is 4.

[飞雪无情 张三 李四 王五]

Before calling append, the capacity is 2. After calling append, the capacity becomes 4, indicating automatic expansion.

Tip: The principle of automatic expansion in the append function is to create a new underlying array, copy the elements from the original slice to the new array, and then return a slice pointing to the new array.

Data Structure #

In Go language, a slice is actually a struct, and its definition is as follows:

type SliceHeader struct {
    Data uintptr
    Len  int
    Cap  int
}

SliceHeader is the runtime representation of a slice, and it has three fields: Data, Len, and Cap.

Data points to the underlying array storing the slice elements.
Len represents the length of the slice.
Cap represents the capacity of the slice.

With these three fields, an array can be abstracted into a slice for better operations. Therefore, different slices may have the same underlying array. Now let’s prove this through an example. The code is as follows:

func main() {
    a1 := [2]string{"飞雪无情","张三"}
    s1 := a1[0:1]
    s2 := a1[:]
    
    // Print the `Data` values of s1 and s2, which are the same
    
    fmt.Println((*reflect.SliceHeader)(unsafe.Pointer(&s1)).Data)
    fmt.Println((*reflect.SliceHeader)(unsafe.Pointer(&s2)).Data)
}

By using unsafe.Pointer and converting them to *reflect.SliceHeader pointers learned in the previous lesson, we can print the value of Data, and the result is as follows:

824634150744
824634150744

You will find that they are the same, which means that these two slices share the same underlying array. Therefore, when assigning slices or performing slice operations, the same array is used, and the elements are not copied. This reduces memory usage and improves efficiency.

Note: Although multiple slices share the same underlying array, if one slice modifies the elements inside, other slices will also be affected. So, be careful when passing a slice as a parameter between functions, and try not to modify the elements of the original slice.

The essence of a slice is SliceHeader, and because function parameters are passed by value, a copy of SliceHeader is passed, not a copy of the underlying array. At this point, the advantages of slices are reflected, because the memory usage of the copy of SliceHeader is very small. Even for a very large slice (the underlying array has many elements), it occupies at most 24 bytes of memory. This solves the problem of memory waste when passing large arrays as arguments.

Tip: The types of the three fields of SliceHeader are uintptr, int, and int. On a 64-bit machine, the maximum size of these three fields is int64, and one int64 occupies 8 bytes. Therefore, three int64 occupies 24 bytes of memory.

To get the values of the three fields of the slice data structure, you can also use a completely customized struct instead of SliceHeader. Just make sure the fields are the same.

For example, in the following example, by converting it to a custom *slice pointer using unsafe.Pointer, you can also get the values of the three fields. You can even rename the fields to d, l, and c to achieve the same purpose.

sh1 := (*slice)(unsafe.Pointer(&s1))
fmt.Println(sh1.Data, sh1.Len, sh1.Cap)

type slice struct {
    Data uintptr
    Len  int
    Cap  int
}

Tip: We should still use SliceHeader as much as possible, because it is the standard provided by the Go language, which can maintain consistency and facilitate understanding.

Reasons for Efficiency #

From the perspective of collection types, arrays, slices, and maps are all collection types because they can all store elements. However, accessing and assigning values in arrays and slices is more efficient because they involve continuous memory operations and can quickly find the address where the element is stored using an index.

Tip: Of course, maps are also very valuable because their keys can be various types, such as int, int64, string, etc. However, the indexes of arrays and slices can only be integers.

Furthermore, when comparing arrays and slices, slices are more efficient because when assigning or passing as function parameters, they do not need to copy all the elements. Instead, only the three fields of the SliceHeader are copied, and they still share the same underlying array.

In the following example, I defined two functions arrayF and sliceF which print the array pointer and the underlying array pointer of the slice passed into them.

func main() {

   a1 := [2]string{"Flying Snow", "Zhang San"}

   fmt.Printf("Main function array pointer: %p\n", &a1)

   arrayF(a1)

   s1 := a1[0:1]

   fmt.Println((*reflect.SliceHeader)(unsafe.Pointer(&s1)).Data)

   sliceF(s1)

}

func arrayF(a [2]string) {

   fmt.Printf("arrayF function array pointer: %p\n", &a)

}

func sliceF(s []string) {

   fmt.Printf("sliceF function Data: %d\n", (*reflect.SliceHeader)(unsafe.Pointer(&s)).Data)

}

Then I call them in the main function, and when running the program, the following result will be printed:

Main function array pointer: 0xc0000a6020

arrayF function array pointer: 0xc0000a6040

824634400800

sliceF function Data: 824634400800

You will notice that the pointers of the same array in the main function and the arrayF function are different, which means that the array is copied when passed as an argument, creating a new array. On the other hand, the Data field of the slice is the same, which means that both slices in the main function and the sliceF function still share the same underlying array, and the underlying array is not copied.

Tip: The efficiency of slices is also reflected in the for range loop, because the temporary variable obtained during the loop is also a value copy. Therefore, when iterating through a large array, the efficiency of slices is higher.

The efficient encapsulation of slices based on pointers is the fundamental reason for their high efficiency, as it reduces memory usage and the time spent on memory copying.

Conversion between string and []byte #

Next, I will further explain the reason for the efficiency of slices by using examples of forced conversion between string and []byte.

For example, I convert a []byte to a string and then convert it back. The example code is as follows:

s := "Flying Snow"

b := []byte(s)

s3 := string(b)

fmt.Println(s, string(b), s3)

In this example, variable s is a string. It can be forcefully converted to a variable b of type []byte using []byte(s), and then forcefully converted to a variable s3 of type string using string(b). When printing the values of all three variables, they are all “Flying Snow”.

In Go language, the forced conversion between string and []byte is implemented by allocating new memory and copying the content. Now, I will verify that the forced conversion is done by reallocating memory using the memory addresses of the real contents pointed to by string and []byte. The following code shows it:

s := "Flying Snow"

fmt.Printf("Memory address of s: %d\n", (*reflect.StringHeader)(unsafe.Pointer(&s)).Data)

b := []byte(s)

fmt.Printf("Memory address of b: %d\n", (*reflect.SliceHeader)(unsafe.Pointer(&b)).Data)

s3 := string(b)

fmt.Printf("Memory address of s3: %d\n", (*reflect.StringHeader)(unsafe.Pointer(&s3)).Data)

If you run them, you will find that the printed memory addresses are not the same. This indicates that although the content is the same, they are not the same string because the memory addresses are different.

Tip: You can view the source code of the runtime.stringtoslicebyte and runtime.slicebytetostring functions to understand the specific implementation of string to []byte and []byte to string conversion.

From the above example code, you already know what a SliceHeader is. In fact, StringHeader is the same as SliceHeader, representing the real structure of strings and slices during program execution. The definition of StringHeader is as follows:

type StringHeader struct {
   Data uintptr
   Len  int
}

In other words, during program execution, strings and slices are essentially StringHeader and SliceHeader. Both of these structs have a Data field, which stores a pointer to the actual content. So by printing the value of the Data field, we can determine whether the conversion of []byte to string results in a reallocation of memory.

Now you know that this type conversion, []byte(s) and string(b), will make a copy of the string. If the string is very large, this method cannot meet the requirements of high-performance programs due to the large memory overhead. Therefore, optimization is needed.

How to optimize? Since memory allocation is the cause of high memory overhead, the optimization approach should be to achieve type conversion without reallocating memory.

By carefully observing the StringHeader and SliceHeader structs, you will find that their first two fields are exactly the same. So the conversion from []byte to string is equivalent to using unsafe.Pointer to convert *SliceHeader to *StringHeader, which is *[]byte to *string. The principle is similar to converting a slice to a custom slice struct as I mentioned earlier.

In the following example, s4 and s3 have the same content. The difference is that s4 does not allocate new memory (zero copy), it uses the same memory as the b variable because their underlying Data field values are the same. This saves memory and achieves the purpose of converting []byte to string.

s := "飞雪无情"

b := []byte(s)

//s3 := string(b)

s4 := *(*string)(unsafe.Pointer(&b))

SliceHeader has three fields: Data, Len, and Cap. StringHeader has two fields: Data and Len. So when converting from *SliceHeader to *StringHeader using unsafe.Pointer, there is no problem because *SliceHeader can provide the Data and Len fields required by *StringHeader. But the reverse is not possible because *StringHeader lacks the Cap field required by *SliceHeader, so we need to add a default value ourselves.

In the following example, b1 and b have the same content. The difference is that b1 does not allocate new memory, it uses the same memory as the s variable because their underlying Data field values are the same. This also saves memory.

s := "飞雪无情"

//b := []byte(s)

sh := (*reflect.SliceHeader)(unsafe.Pointer(&s))

sh.Cap = sh.Len

b1 := *(*[]byte)(unsafe.Pointer(sh))

Note: After converting string to []byte using unsafe.Pointer, you cannot modify []byte, such as b1[0] = 12. This will cause an exception and crash the program. This is because strings are read-only in Go.

The use of unsafe.Pointer to perform type conversion and avoid memory copying to improve performance is also used in the Go standard library, for example, in the strings.Builder struct. In its String method, which converts the []byte type buf to string, unsafe.Pointer is used to improve efficiency:

// String returns the accumulated string.
func (b *Builder) String() string {
   return *(*string)(unsafe.Pointer(&b.buf))
}

The conversion between string and []byte is a good example of using the SliceHeader struct. It allows zero-copy type conversion, improves efficiency, and avoids memory waste.

Summary #

Through the analysis of slices, I believe you can deeply feel the charm of Go. It encapsulates pointers, arrays, and other low-level concepts, and provides the concept of slices to developers, which is both convenient and efficient, and improves program performance.

The design of slices in Go has a lot of inspiring significance. You can also use uintptr or fields of slice type to optimize performance, just like the Data uintptr field in the Go SliceHeader.

At the end of this lesson, here is a question for you to consider: What are some other examples where performance is improved using unsafe.Pointer or uintptr? Feel free to discuss in the comments.

In the next lesson, we will enter the module of project management, and first learn “Quality Assurance: How does Go ensure quality through testing?” Remember to tune in!