49 Program Performance Analysis Basics Part 2

49 Program Performance Analysis Basics - Part 2 #

Hello, I’m Hao Lin, and today we will continue sharing the basics of program performance analysis.

In the previous article, we discussed the topic of “how to sample high-level information from the CPU”. Today, let’s explore its related issues together.

Knowledge Expansion #

Question 1: How to set the sampling frequency of memory profile? #

The sampling of memory profile collects the heap memory usage of a Go program during runtime at a certain proportion. Setting the sampling frequency of memory profile is simple, just assign a value to the variable runtime.MemProfileRate.

The meaning of this variable is, on average, how many bytes are allocated before sampling the heap memory usage. If the value of this variable is set to 0, the Go runtime system will stop sampling the memory profile completely. The default value of this variable is 512 KB.

Note that if you want to set the sampling frequency, it is better to set it early and only set it once, otherwise it may have a negative impact on the sampling work of the Go runtime system. For example, set it once at the beginning of the main function.

After that, when we want to obtain the memory profile, we need to call the WriteHeapProfile function in the runtime/pprof package. This function will write the collected memory profile to the specified writer.

Note that the memory profile obtained through the WriteHeapProfile function is not real-time, it is a snapshot taken at the completion of the most recent garbage collection. If you want real-time information, you can call the runtime.ReadMemStats function. However, please note that this function will cause a brief pause in the Go scheduler.

Above is a brief answer to the question of setting the sampling frequency of memory profile.

Question 2: How to obtain the blocking profile? #

We can set the sampling frequency of the blocking profile by calling the SetBlockProfileRate function in the runtime package. This function takes an int parameter named rate.

The meaning of this parameter is, whenever a blocking event has a duration exceeding a certain number of nanoseconds, it can be sampled. If the value of this parameter is less than or equal to 0, it means that the Go runtime system will stop sampling the blocking profile completely.

In the runtime package, there is also a package-private variable named blockprofilerate, which is of type uint64. The meaning of this variable is, whenever a blocking event has a duration that spans a certain number of CPU cycles, it can be sampled. Its meaning is very similar to the meaning of the rate parameter mentioned earlier, isn’t it?

In fact, the only difference between these two is the unit of measurement. The SetBlockProfileRate function first converts the value of the rate parameter into the desired unit and performs necessary type conversions, and then atomically assigns the converted result to the blockprofilerate variable. Since the default value of this variable is 0, the Go runtime system does not record any blocking events in the program by default.

On the other hand, when we want to obtain the blocking profile, we need to call the Lookup function in the runtime/pprof package and pass the value "block" as the argument, which will return a *runtime/pprof.Profile value (referred to as Profile value for short). After that, we also need to call the WriteTo method of this Profile value to drive it to write the profile information into the specified writer.

This WriteTo method has two parameters, one is the writer mentioned earlier, which is of type io.Writer. The other parameter represents the level of detail of the profile information and is of type int.

The main optional values of the debug parameter are two, namely: 0 and 1. When debug is 0, the profile information written to the writer by the WriteTo method will only contain memory addresses needed by the go tool pprof tool, and these memory addresses will be shown in hexadecimal form.

When debug is 1, the corresponding package names, function names, source file paths, and line numbers will be added as comments. In addition, the profile information in the case of debug being 0 will be converted into byte stream via protocol buffers. When debug is 1, the profile information outputted by the WriteTo method will be plain text that we can understand. In addition, the value of debug can also be 2. In this case, the output summary information will be plain text and usually includes more details. As for what these details include, it depends on the parameter value we pass when calling the runtime/pprof.Lookup function. Now let’s take a look at this function together.

Question 3: What is the correct way to call the `runtime/pprof.Lookup` function? #

The runtime/pprof.Lookup function (hereinafter referred to as the Lookup function) provides summary information corresponding to the given name. This summary information is represented by a Profile value. If this function returns nil, it means that there is no summary information corresponding to the given name.

The runtime/pprof package has predefined 6 summary names. The corresponding methods for collecting and outputting summary information are already prepared. We can use them directly. They are: goroutine, heap, allocs, threadcreate, block, and mutex.

When we pass "goroutine" to the Lookup function, it will use the corresponding method to collect the stack trace information of all goroutines currently in use. Note that this collection will cause a brief pause in the Go language scheduler.

When the WriteTo method of the Profile value returned by the function is called, if the value of the debug parameter is greater than or equal to 2, the method will output the stack trace information of all goroutines. There may be a lot of this information. If they take up more than 64 MB, the corresponding method will truncate the excess.

If the parameter value received by the Lookup function is "heap", it will collect sampling information related to heap memory allocation and release. This is actually the memory summary information we discussed earlier. The subsequent operations are very similar when we pass "allocs".

In these two cases, the Profile values returned by the Lookup function are also very similar. The only difference is that when the WriteTo method of these two Profile values is called, the output summary information will have slight differences, and this only applies when the debug parameter is equal to 0.

"heap" will cause the output memory summary information to default to the perspective of “in use space”, while the default perspective for "allocs" is “allocated space”.

“In use space” refers to the allocated memory space that has not been released. In this perspective, the go tool pprof tool does not consider the part of information related to released space. In the perspective of “allocated space”, all memory allocation information will be displayed, regardless of whether these memory spaces have been released during sampling.

In addition, whether it is "heap" or "allocs", when we call the WriteTo method of the Profile value, as long as the value assigned to the debug parameter is greater than 0, the specifications of the output content will be the same.

The parameter value "threadcreate" will cause the Lookup function to collect some stack trace information. Each of these stack trace information describes a chain of code calls that lead to the creation of a new operating system thread. The output specifications of such Profile values also only have two types, depending on whether the value passed to their WriteTo method is greater than 0.

Back to "block" and "mutex". "block" represents the stack trace information of the code that is blocked due to contention of synchronization primitives. Do you remember? This is the blocking summary information we talked about earlier.

In contrast, "mutex" represents the stack trace information of the code that used to own the synchronization primitive. Their output specifications also only have two types, depending on whether debug is greater than 0.

The synchronization primitives mentioned here refer to a low-level synchronization tool or mechanism that exists in the Go language runtime system.

It operates directly on memory addresses and uses asynchronous semaphores and atomic operations as implementation methods. Channels, mutexes, condition variables, WaitGroup, and the Go language runtime system itself all use it to implement their own functions.

Alright, we’ve talked quite a bit about this issue. I believe you now have a deep understanding of the usage and meaning behind the Lookup function. The demo99.go file contains some sample code for your reference.

Question 4: How to add a performance profiling interface to a network service based on the HTTP protocol? #

This question is actually quite simple. In most cases, all we need to do is import the net/http/pprof package into our program, like this:

import _ "net/http/pprof"

Then, start the network service and begin listening, for example:

log.Println(http.ListenAndServe("localhost:8082", nil))

After running this program, we can access a minimalist webpage by visiting the address http://localhost:8082/debug/pprof in a web browser. If you carefully read the previous question, you should be able to quickly understand the meaning of each section on this webpage.

There are many available subpaths under the /debug/pprof/ URL path, which you can explore by clicking on the links in the webpage. These six subpaths – allocs, block, goroutine, heap, mutex, and threadcreate – are all handled by the Lookup function underneath. By now, you should already be familiar with this function.

All of these subpaths can accept a query parameter called debug. It controls the format and level of detail of the summary information. I won’t go into the optional values for this parameter again. Its default value is 0. Additionally, there is another query parameter called gc that controls whether to force a garbage collection before obtaining the summary information. If its value is greater than 0, the program will do so. However, this parameter is only effective under the /debug/pprof/heap path.

Once the /debug/pprof/profile path is accessed, the program will start sampling CPU profile information. It accepts a query parameter called seconds, which specifies how long the sampling should last. If this parameter is not explicitly specified, the sampling will last for 30 seconds. Note that under this path, the program only responds with protobuf-converted byte streams. We can directly read such an HTTP response using the go tool pprof tool, for example:

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=60

In addition, there is another path worth mentioning: /debug/pprof/trace. Under this path, the program mainly handles our requests using the API provided by the runtime/trace package.

More specifically, the program first calls the trace.Start function, and then, after the duration specified by the seconds query parameter, it calls the trace.Stop function. The default value for seconds is 1 second. As for the functionality of the runtime/trace package, I’ll leave it to you to explore and discover.

The aforementioned URL paths are fixed and unchangeable. These are the default access rules. However, we can also customize them, like this:

mux := http.NewServeMux()
pathPrefix := "/d/pprof/"
mux.HandleFunc(pathPrefix,
    func(w http.ResponseWriter, r *http.Request) {
        name := strings.TrimPrefix(r.URL.Path, pathPrefix)
        if name != "" {
            pprof.Handler(name).ServeHTTP(w, r)
            return
        }
        pprof.Index(w, r)
    })
mux.HandleFunc(pathPrefix+"cmdline", pprof.Cmdline)
mux.HandleFunc(pathPrefix+"profile", pprof.Profile)
mux.HandleFunc(pathPrefix+"symbol", pprof.Symbol)
mux.HandleFunc(pathPrefix+"trace", pprof.Trace)

server := http.Server{
    Addr:    "localhost:8083",
    Handler: mux,
}

As you can see, with just a few entities from the net/http/pprof package, we were able to achieve this customization. This is especially useful when using third-party network service development frameworks.

The access rules included in our custom HTTP request multiplexer mux are similar to the default rules, except that the URL path prefix is a bit shorter.

The process of customizing mux is similar to what the init function in the net/http/pprof package does. In fact, the existence of this init function is the reason why we can access the related paths simply by importing the “net/http/pprof” package.

When we write network service programs, using the net/http/pprof package is much more convenient and practical than directly using the runtime/pprof package. Through proper utilization, this package can provide strong support for monitoring network services. This is as much as I’ll introduce about this package for now.

Summary #

In these two articles, we mainly talked about the performance analysis of Go programs, and many of the things mentioned are essential knowledge and skills for you. These will help you truly understand the series of operations represented by sampling, collection, and output.

I mentioned several issues related to summary information. What you need to remember is what each type of summary information represents and what kind of content it contains.

You also need to know the correct way to obtain them, including how to start and stop sampling, how to set the sampling frequency, and how to control the format and level of detail of the output.

In addition, the correct way to call the Lookup function in the runtime/pprof package is also important. For other summary information besides CPU summary, we can obtain them by calling this function.

In addition, I also mentioned an upper-level application, which is to add a performance analysis interface to network services based on the HTTP protocol. This is also a very practical part.

Although the net/http/pprof package provides not many program entities, it allows us to embed performance analysis interfaces in different ways. Some of these ways are extremely simple and ready-to-use, while others are used to meet various custom requirements.

These are the Go language knowledge that I have presented to you today, and they are the foundation of program performance analysis. If you use Go language programs in a production environment, you will definitely come across them. I hope you can think about and understand all the content and issues mentioned here seriously. Only in this way can you truly handle them with ease when you actually use them.

Thought Questions #

The thought question I left for you today has actually been revealed before, which is: What is the function of the runtime/trace package?

Thank you for listening, and we’ll see you next time.

Click here to view the detailed code accompanying the Go language column articles.