38 Performance Analysis How to Analyze Performance of Go Code

38 Performance Analysis How to Analyze Performance of Go Code #

Hello, I’m Kong Lingfei.

As developers, we often focus on unit testing for functionality and tend to overlook performance details. However, if we don’t have a comprehensive understanding of the overall performance of our project when it goes live, we may encounter various issues as the request volume increases, such as high CPU usage, high memory utilization, and high request latency. To avoid these performance bottlenecks, we need to use certain methods during the development process to analyze the performance of our programs.

Go language has built-in tools and methods for performance optimization and monitoring, which greatly improve the efficiency of our profile analysis. With these tools, we can easily analyze the performance of Go programs. In Go development, developers mainly use the built-in pprof package for performance analysis.

During performance analysis, we first use some tools and packages to generate performance data files, and then use the pprof tool to analyze these files, thus analyzing the performance of the code. Now, let’s take a look at how to perform these two steps separately.

Generating Performance Data Files #

To view performance data, you need to generate performance data files first. There are three methods to generate performance data files, which are using the command line, using code, and using the net/http/pprof package. These tools and packages will generate CPU and memory performance data respectively.

Next, let’s take a look at how these three methods generate performance data files.

Generating Performance Data Files Using the Command Line #

We can use go test -cpuprofile to generate performance test data. Go to the internal/apiserver/service/v1 directory and execute the following command:

$ go test -bench=".*" -cpuprofile cpu.profile -memprofile mem.profile
goos: linux
goarch: amd64
pkg: github.com/marmotedu/iam/internal/apiserver/service/v1
cpu: AMD EPYC Processor
BenchmarkListUser-8   	     280	   4283077 ns/op
PASS
ok  	github.com/marmotedu/iam/internal/apiserver/service/v1	1.798s

The above command will generate three files in the current directory:

  • v1.test, the compiled binary file for testing, which can be used to resolve symbols during performance analysis.
  • cpu.profile, the CPU performance data file.
  • mem.profile, the memory performance data file.

Generating Performance Data Files Using Code #

We can also generate performance data files using code, for example, in the pprof.go file:

package main

import (
	"os"
	"runtime/pprof"
)

func main() {
	cpuOut, _ := os.Create("cpu.out")
	defer cpuOut.Close()
	pprof.StartCPUProfile(cpuOut)
	defer pprof.StopCPUProfile()

	memOut, _ := os.Create("mem.out")
	defer memOut.Close()
	defer pprof.WriteHeapProfile(memOut)

	Sum(3, 5)

}

func Sum(a, b int) int {
	return a + b
}

Execute the pprof.go file:

$ go run pprof.go

After running the pprof.go file, cpu.profile and mem.profile performance data files will be generated in the current directory.

Generating Performance Data Files Using net/http/pprof #

If you want to analyze the performance of an HTTP Server, you can use the net/http/pprof package to generate performance data files.

In the IAM project, Gin framework is used as the HTTP engine. Therefore, the IAM project uses the github.com/gin-contrib/pprof package to enable HTTP performance analysis. The github.com/gin-contrib/pprof package is a simple wrapper for net/http/pprof, which converts the pprof functionality into a Gin middleware, allowing the pprof middleware to be loaded as needed.

In the pprof.go file of the github.com/gin-contrib/pprof package, there is the following code:

func Register(r *gin.Engine, prefixOptions ...string) {
    prefix := getPrefix(prefixOptions...)

    prefixRouter := r.Group(prefix)
    {
        ...
        prefixRouter.GET("/profile", pprofHandler(pprof.Profile))
        ...
    }
}

func pprofHandler(h http.HandlerFunc) gin.HandlerFunc {
    handler := http.HandlerFunc(h)
    return func(c *gin.Context) {
        handler.ServeHTTP(c.Writer, c.Request)
    }
}

From the above code, you can see that the github.com/gin-contrib/pprof package converts net/http/pprof.Profile into gin.HandlerFunc, which is the Gin middleware.

To enable HTTP performance analysis, you only need to register the pprof HTTP handler in the code (located in the internal/pkg/server/genericapiserver.go file):

// install pprof handler
if s.enableProfiling {
    pprof.Register(s.Engine)
}

The above code determines whether to enable the HTTP performance analysis based on the --feature.profiling configuration. After enabling the HTTP performance analysis, when starting the HTTP service iam-apiserver, you can access http:// x.x.x.x:8080/debug/pprof (x.x.x.x is the address of the Linux server) to view the profiles information. The profiles information is shown in the following image:

Image

We can use the following command to get the CPU performance data file:

$ curl http://127.0.0.1:8080/debug/pprof/profile -o cpu.profile

After executing the above command, wait for 30 seconds. pprof will collect performance data for these 30 seconds. During this time, we need to continuously send multiple requests to the server, and the frequency of the requests can be determined according to our scenario. After 30 seconds, the /debug/pprof/profile endpoint will generate the CPU profile file, which will be saved in the cpu.profile file in the current directory by the curl command.

Similarly, we can execute the following command to generate the memory performance data file:

$ curl http://127.0.0.1:8080/debug/pprof/heap -o mem.profile

The above command will automatically download the heap file, which will be saved in the mem.profile file in the current directory by the curl command.

We can use the go tool pprof [mem|cpu].profile command to analyze the CPU and memory performance of the HTTP interface. We can also use the command go tool pprof http://127.0.0.1:8080/debug/pprof/profile or go tool pprof http://127.0.0.1:8080/debug/pprof/heap to directly enter the interactive shell of the pprof tool. go tool pprof will first download and save the CPU and memory performance data files, and then analyze these files.

Using the above three methods, we have generated cpu.profile and mem.profile, and now we can use go tool pprof to analyze these two performance data files and analyze the CPU and memory performance of our program. Next, I will explain the process of performance analysis in detail.

Performance Analysis #

To analyze performance using go tool pprof, you can refer to the following diagram:

Image

First, let me introduce the pprof tool, then explain how to generate performance data, and finally, I will introduce CPU and memory performance analysis methods.

Introduction to pprof Tool #

pprof is a Go program performance analysis tool that allows you to access and analyze performance data files. It also provides readable output information according to our requirements. Go integrates profile sampling tools at the language level, so you can simply import the runtime/pprof or net/http/pprof packages into your code to obtain the program’s profile files and perform performance analysis based on these files.

net/http/pprof is a wrapper around the runtime/pprof package and exposes it on an HTTP port.

Generating Performance Data #

When performing performance analysis, the main focus is on analyzing memory and CPU performance. To analyze the performance of memory and CPU, we need to generate performance data files first. In the IAM source code, there are also performance test cases available. Next, I will use the performance test cases in the IAM source code to demonstrate how to analyze the performance of a program.

Go to the internal/apiserver/service/v1 directory. The user_test.go file contains the performance test function BenchmarkListUser. Execute the following command to generate performance data files:

$ go test -benchtime=30s -benchmem -bench=".*" -cpuprofile cpu.profile -memprofile mem.profile
goos: linux
goarch: amd64
pkg: github.com/marmotedu/iam/internal/apiserver/service/v1
cpu: AMD EPYC Processor
BenchmarkListUser-8   	     175	 204523677 ns/op	   15331 B/op	     268 allocs/op
PASS
ok  	github.com/marmotedu/iam/internal/apiserver/service/v1	56.514s

The above command will generate the cpu.profile, mem.profile performance data files, and the v1.test binary file in the current directory. Next, we will use these files to analyze the CPU and memory performance of the code. To obtain enough sampling data, we set the benchmark time to 30s.

When performing performance analysis, different methods can be used, such as analyzing sampling graphs, analyzing flame graphs, or using the interactive mode of go tool pprof to view CPU and memory consumption data of functions. I will use these methods to analyze CPU and memory performance.

CPU Performance Analysis #

By default, Go’s runtime system samples CPU usage at a frequency of 100 Hz, which means it samples 100 times per second or every 10 milliseconds. During each sample, the running function is recorded, and its execution time is measured to generate CPU performance data.

We have already generated the CPU performance data file cpu.profile. Next, we will use the three methods mentioned above to analyze this performance file and optimize performance.

Method 1: Analyzing the Sampling Graph

The most intuitive way to analyze performance is through graphical representation. Therefore, we first need to generate a sampling graph, which involves two steps.

Step 1, make sure that graphviz is installed on your system:

$ sudo yum -y install graphviz.x86_64

Step 2, generate the call graph by executing go tool pprof:

$ go tool pprof -svg cpu.profile > cpu.svg  # svg format
$ go tool pprof -pdf cpu.profile > cpu.pdf # pdf format
$ go tool pprof -png cpu.profile > cpu.png # png format

The above commands will generate the cpu.pdf, cpu.svg, and cpu.png files, which contain the function call relationships and other sampling data. The image below shows an example:

Image

This image consists of directed segments and rectangles. Let’s first look at the meaning of the directed segments.

The directed segments describe the function call relationships, while the rectangles contain CPU sampling data. From the graph, we can see that the end without an arrow calls the end with an arrow, indicating that the v1.(*userService).List function calls the fake.(*policies).List function.

The number 90ms next to the segment indicates that the v1.(*userService).List function, during the sampling period, consumed a total of 90ms when calling the fake.(*policies).List function. Through the function call relationships, we can determine which functions a certain function calls and how much time is spent on each function call.

Here, let’s interpret the important information in the call relationships of the graph:

Image

In the accumulated sampling time (140ms) of runtime.schedule, 10ms comes from direct calls to the runtime.goschedImpl function, and 70ms comes from direct calls to the runtime.park_m function. These data explain which functions call the runtime.schedule function and how frequently it is called. It is also because of this reason that the time spent on function runtime.goschedImpl calling function runtime.schedule must be less than or equal to the accumulated sampling time of function runtime.schedule.

Now let’s take a look at the sampling data in the rectangles. These rectangles generally contain three types of information:

  • Function name/method name: This information includes the package name, structure name, and function/method name, making it easy for us to locate the function/method. For example, fake.(*policies).List indicates the List method of the policies structure in the fake package.
  • Local sampling time and its proportion in the total sampling count: The local sampling time refers to the total time that the sampling point falls within this function.
  • Accumulated sampling time and its proportion in the total sampling count: The accumulated sampling time refers to the total time that the sampling point falls within this function, as well as the functions directly or indirectly called by it.

We can explain the concepts of local sampling time and accumulated sampling time using the OutDir function as shown in the image below:

Image

We can consider the total execution time of the entire function as the accumulated sampling time, which includes the time spent on the white part of the code and the time spent on function calls (indicated by the red part). The time spent on the white part of the code can be considered as the local sampling time.

Through the accumulated sampling time, we can determine the total execution time of a function. The larger the accumulated sampling time, the more CPU time is consumed by calling it. However, you should note that this does not necessarily mean that the function itself is problematic. It could also be due to performance bottlenecks in the functions called directly or indirectly by this function. In this case, we should follow the function call relationships and look for the functions that consume the most CPU time. If the local sampling time of a function is large, it means that the function itself has a high execution time (excluding the time spent on calling other functions). In this case, we need to analyze the code of this function itself, rather than the code of the functions called directly or indirectly by this function. In the sampling chart, the larger the area of the rectangle, the longer the cumulative sampling time of the function. Therefore, if a function has a large area in the sampling chart, we need to analyze it carefully because there may be performance optimization opportunities.

Method 2: Analyze the Flame Graph

The sampling chart we discussed above may not be very intuitive for performance analysis. Here, we can generate flame graphs to visualize the performance bottlenecks. A flame graph is a tool invented by Brendan Gregg specifically for visualizing sampled stack traces as an intuitive image, named because the entire graph looks like a flickering flame.

The go tool pprof command provides the -http parameter, which allows us to view the sampling chart and flame graph through a web browser. Execute the following command:

$ go tool pprof -http="0.0.0.0:8081" v1.test cpu.profile

Then access http://x.x.x.x:8081/ (x.x.x.x is the IP address of the server where the go tool pprof command is executed), and various sampling view data will be displayed in the browser, as shown in the following figure:

Image

The above UI page provides different sampling data views:

  • Top: Similar to the form of linux top, sorted from high to low.
  • Graph: The default view that shows the call relationship.
  • Flame Graph: pprof flame graph.
  • Peek: Similar to Top, also sorted from high to low.
  • Source: Similar to the interactive command-line mode, with source code annotations.
  • Disassemble: Show all totals.

Next, let’s focus on analyzing the flame graph. Select Flame Graph (VIEW -> Flame Graph) in the UI, and the flame graph will be displayed, as shown in the following figure:

Image

The flame graph has the following features:

  • Each column represents a call stack, and each cell represents a function.
  • The y-axis shows the depth of the stack, arranged from top to bottom according to the call relationship. The bottom cell represents the function that was occupying the CPU at the time of sampling.
  • The call stacks are sorted alphabetically from left to right, and identical call stacks are merged. Therefore, the wider a cell is, the more likely the corresponding function is a bottleneck.
  • The color of the cells in the flame graph is randomly warm-toned, making it easy to distinguish between different call information.

When viewing the flame graph, the wider a cell is, the more likely there is a performance issue with the corresponding function. At this point, we can analyze the code of that function to find the problem.

Method 3: Use the go tool pprof interactive mode to view detailed data

We can execute the go tool pprof command to view the CPU performance data file:

$ go tool pprof v1.test cpu.profile
File: v1.test
Type: cpu
Time: Aug 17, 2021 at 2:17pm (CST)
Duration: 56.48s, Total samples = 440ms ( 0.78%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

go tool pprof outputs a lot of information:

  • File: The name of the binary executable file.
  • Type: The type of the sampling file, such as cpu, mem, etc.
  • Time: The time at which the sampling file was generated.
  • Duration: The program execution time. In the example above, the total execution time of the program is 37.43s, and the sampling time is 42.37s. When sampling, the sampling program assigns sampling tasks to multiple cores, so the total sampling time may be greater than the total execution time.
  • (pprof): Command prompt indicating that we are currently in the pprof tool command line of the go tool, and go tool includes multiple commands such as cgo, doc, pprof, trace, etc.

After executing the go tool pprof command, we will enter an interactive shell. In this interactive shell, we can execute multiple commands. The most commonly used commands are three, as shown in the following table:

Image

In the interactive interface, we execute the top command to view the performance sample data:

(pprof) top
Showing nodes accounting for 350ms, 79.55% of 440ms total
Showing top 10 nodes out of 47
      flat  flat%   sum%        cum   cum%
     110ms 25.00% 25.00%      110ms 25.00%  runtime.futex
      70ms 15.91% 40.91%       90ms 20.45%  github.com/marmotedu/iam/internal/apiserver/store/fake.(*policies).List
      40ms  9.09% 50.00%       40ms  9.09%  runtime.epollwait
      40ms  9.09% 59.09%      180ms 40.91%  runtime.findrunnable
      30ms  6.82% 65.91%       30ms  6.82%  runtime.write1
      20ms  4.55% 70.45%       30ms  6.82%  runtime.notesleep
      10ms  2.27% 72.73%      100ms 22.73%  github.com/marmotedu/iam/internal/apiserver/service/v1.(*userService).List
      10ms  2.27% 75.00%       10ms  2.27%  runtime.checkTimers
      10ms  2.27% 77.27%       10ms  2.27%  runtime.doaddtimer
      10ms  2.27% 79.55%       10ms  2.27%  runtime.mallocgc

In the output above, each row represents information about a function. The most important command in the pprof program is the topN command, which is used to display the top N samples in the profile file. The top command will output multiple lines of information, each line representing a function’s sample data, by default sorted by flat%. The meaning of each column in the output is as follows:

  • flat: The total time the sample point falls within this function.
    • flat%: The percentage of sampling points that fall within the time of this function.
    • sum%: The accumulated value of flat% for the previous lines, which is the cumulative percentage of the previous item.
    • cum: The total time spent in this function and the functions called by it.
    • cum%: The percentage of the total number of times the sampling point falls within this function and the functions called by it.
    • Function name.

The above information can tell us the execution time and ranking of function performance. Based on this information, we can determine which functions may have performance issues or which functions can be further optimized.

I would like to remind you that if you execute go tool pprof mem.profile, the meanings of the fields mentioned above are similar, except that this time it represents the size of memory allocations in bytes.

By default, the top command is sorted by flat%. When performing performance analysis, we need to first sort by cum. By looking at cum, we can intuitively see which function has the highest total time consumption. Then, we can refer to the local sampling time and call relationship of that function to determine whether it is the function itself that is causing the high time consumption or the functions it calls.

The output of top -cum is as follows:

(pprof) top20 -cum
Showing nodes accounting for 280ms, 63.64% of 440ms total
Showing top 20 nodes out of 47
          flat  flat%   sum%        cum   cum%
             0     0%     0%      320ms 72.73%  runtime.mcall
             0     0%     0%      320ms 72.73%  runtime.park_m
             0     0%     0%      280ms 63.64%  runtime.schedule
          40ms  9.09%  9.09%      180ms 40.91%  runtime.findrunnable
         110ms 25.00% 34.09%      110ms 25.00%  runtime.futex
          10ms  2.27% 36.36%      100ms 22.73%  github.com/marmotedu/iam/internal/apiserver/service/v1.(*userService).List
             0     0% 36.36%      100ms 22.73%  github.com/marmotedu/iam/internal/apiserver/service/v1.BenchmarkListUser
             0     0% 36.36%      100ms 22.73%  runtime.futexwakeup
             0     0% 36.36%      100ms 22.73%  runtime.notewakeup
             0     0% 36.36%      100ms 22.73%  runtime.resetspinning
             0     0% 36.36%      100ms 22.73%  runtime.startm
             0     0% 36.36%      100ms 22.73%  runtime.wakep
             0     0% 36.36%      100ms 22.73%  testing.(*B).launch
             0     0% 36.36%      100ms 22.73%  testing.(*B).runN
          70ms 15.91% 52.27%       90ms 20.45%  github.com/marmotedu/iam/internal/apiserver/store/fake.(*policies).List
          10ms  2.27% 54.55%       50ms 11.36%  runtime.netpoll
          40ms  9.09% 63.64%       40ms  9.09%  runtime.epollwait
             0     0% 63.64%       40ms  9.09%  runtime.modtimer
             0     0% 63.64%       40ms  9.09%  runtime.resetForSleep
             0     0% 63.64%       40ms  9.09%  runtime.resettimer (inline)

From the above output, we can see that the local sampling time percentages of v1.BenchmarkListUser, testing.(*B).launch, and testing.(*B).runN are all 0%, but their cumulative sampling time percentages are relatively high, at 22.73%, 22.73%, and 22.73%, respectively.

Although the local sampling time percentage is small, the cumulative sampling time percentage is high, indicating that these three functions have high time consumption due to calling other functions, and they themselves have almost no time consumption. We can see the call relationship of the functions based on the sampling graph, as shown in the following image: Image

From the sampling chart, we can see that the final v1.BenchmarkListUser called the v1.(*userService).List function. The v1.(*userService).List function is a function we wrote. The local sampling time percentage of this function is 2.27%, but the cumulative sampling time percentage is as high as 22.73%. This indicates that the v1.(*userService).List function called other functions and consumed a large amount of CPU time.

By observing the sampling chart further, it can be seen that the long runtime of v1.(*userService).List is due to the call to the fake.(*policies).List function. We can also use the list command to view the runtime performance inside the function:

Image

list userService.*List will list the runtime performance of the code inside the userService struct’s List method. From the above image, it can also be seen that u.store.Policies().List takes the most time. The local sampling time percentage of fake.(*policies).List is 15.91%, indicating that the fake.(*policies).List function itself may have bottlenecks. By reading the code of fake.(*policies).List, it can be found that this function is a database query function, and database queries may have delays. By continuing to look at the v1.(*userService).List code, we can find the following calling logic:

func (u *userService) ListWithBadPerformance(ctx context.Context, opts metav1.ListOptions) (*v1.UserList, error) {
    ...
    for _, user := range users.Items {
        policies, err := u.store.Policies().List(ctx, user.Name, metav1.ListOptions{})
        ...
        })
    }
    ...
}

In the for loop, we called the fake.(*policies).List function in sequence, and the delayed fake.(*policies).List function was called for every iteration. With multiple calls, the runtime of the v1.(*userService).List function naturally accumulates.

Now that we have identified the problem, how do we optimize it? You can take advantage of the CPU’s multi-core feature and start multiple goroutines. This way, our query time is not accumulated serially, but depends on the slowest fake.(*policies).List call. The code for the optimized v1.(*userService).List function can be found in internal/apiserver/service/v1/user.go. Tested with the same performance test case, the results are as follows:

$ go test -benchtime=30s -benchmem -bench=".*" -cpuprofile cpu.profile -memprofile mem.profile
goos: linux
goarch: amd64
pkg: github.com/marmotedu/iam/internal/apiserver/service/v1
cpu: AMD EPYC Processor
BenchmarkListUser-8   	    8330	   4271131 ns/op	   26390 B/op	     484 allocs/op
PASS
ok  	github.com/marmotedu/iam/internal/apiserver/service/v1	36.179s

In the code above, ns/op is 4271131 ns/op. As can be seen, compared with the first test result of 204523677 ns/op, the performance has improved by 97.91%.

Here, please note that for your reference, I renamed the original v1.(*userService).List function to v1.(*userService).ListWithBadPerformance.

Memory Performance Analysis #

During the runtime of a Go program, the Go runtime system records all heap memory allocations. It does not matter at which moment the samples are taken or whether the number of allocated bytes has increased, as long as bytes are allocated and the quantity is sufficient, the profiler will sample it.

The method of memory performance analysis is similar to CPU performance analysis, so I won’t repeat it here. You can analyze it yourself using the generated memory performance data file mem.profile. Next, let me show you the effects before and after memory optimization. In the v1.(*userService).List function (located in the internal/apiserver/service/v1/user.go file), we have the following code:

infos := make([]*v1.User, 0)
for _, user := range users.Items {
    info, _ := m.Load(user.ID)
    infos = append(infos, info.(*v1.User))
}

At this point, we run the go test command to test the memory performance as the performance data after optimization and make a comparison:

$ go test -benchmem -bench=".*" -cpuprofile cpu.profile -memprofile mem.profile
goos: linux
goarch: amd64
pkg: github.com/marmotedu/iam/internal/apiserver/service/v1
cpu: AMD EPYC Processor
BenchmarkListUser-8   	     278	   4284660 ns/op	   27101 B/op	     491 allocs/op
PASS
ok  	github.com/marmotedu/iam/internal/apiserver/service/v1	1.779s

The values for B/op and allocs/op are 27101 B/op and 491 allocs/op, respectively.

By analyzing the code, we found that we can optimize infos := make([]*v1.User, 0) to infos := make([]*v1.User, 0, len(users.Items)) to reduce the number of memory reallocations for the Go slice. The optimized code is as follows:

//infos := make([]*v1.User, 0)
infos := make([]*v1.User, 0, len(users.Items))
for _, user := range users.Items {
    info, _ := m.Load(user.ID)
    infos = append(infos, info.(*v1.User))
}

Let’s execute go test again to test the performance:

$ go test -benchmem -bench=".*" -cpuprofile cpu.profile -memprofile mem.profile
goos: linux
goarch: amd64
pkg: github.com/marmotedu/iam/internal/apiserver/service/v1
cpu: AMD EPYC Processor
BenchmarkListUser-8   	     276	   4318472 ns/op	   26457 B/op	     484 allocs/op
PASS
ok  	github.com/marmotedu/iam/internal/apiserver/service/v1	1.856s

The optimized values for B/op and allocs/op are 26457 B/op and 484 allocs/op, respectively. Compared to the initial values of 27101 B/op and 491 allocs/op, the number of memory allocations is reduced, and each allocation size is smaller.

We can use the go tool pprof command to view the CPU performance data file:

$ go tool pprof v1.test mem.profile
File: v1.test
Type: alloc_space
Time: Aug 17, 2021 at 8:33pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

This command will enter an interactive mode, and you can use the top command in the interactive mode to view the performance sample data, for example:

(pprof) top
Showing nodes accounting for 10347.32kB, 95.28% of 10859.34kB total
Showing top 10 nodes out of 52
      flat  flat%   sum%        cum   cum%
3072.56kB 28.29% 28.29%  4096.64kB 37.72%  github.com/marmotedu/iam/internal/apiserver/service/v1.(*userService).List.func1
1762.94kB 16.23% 44.53%  1762.94kB 16.23%  runtime/pprof.StartCPUProfile
1024.52kB  9.43% 53.96%  1024.52kB  9.43%  go.uber.org/zap/buffer.NewPool.func1
1024.08kB  9.43% 63.39%  1024.08kB  9.43%  time.Sleep
 902.59kB  8.31% 71.70%   902.59kB  8.31%  compress/flate.NewWriter
 512.20kB  4.72% 76.42%  1536.72kB 14.15%  github.com/marmotedu/iam/internal/apiserver/service/v1.(*userService).List
 512.19kB  4.72% 81.14%   512.19kB  4.72%  runtime.malg
 512.12kB  4.72% 85.85%   512.12kB  4.72%  regexp.makeOnePass
 512.09kB  4.72% 90.57%   512.09kB  4.72%  github.com/marmotedu/iam/internal/apiserver/store/fake.FakeUsers
 512.04kB  4.72% 95.28%   512.04kB  4.72%  runtime/pprof.allFrames

The meanings of the fields in the above memory performance data are as follows:

  • flat: Total memory consumption in this function for the sampled points.
  • flat%: Percentage of the sampled points in this function.
  • sum%: Accumulated percentage of the previous item.
  • cum: Total memory consumption in this function and the functions it calls for the sampled points.
  • cum%: Percentage of the total memory consumption in this function and the functions it calls for the sampled points.
  • Function name.

Summary #

When the performance of a Go project is low, we need to analyze the problematic code. The go tool pprof tool provided by Go allows us to analyze the performance of the code. We can analyze the performance of the code through two steps: generating performance data files and analyzing performance data files.

There are three ways in Go to generate performance data files: generating performance data files through the command line, generating performance data files through code, and generating performance data files through net/http/pprof.

After generating the performance data file, we can use the go tool pprof tool to analyze it. We can obtain performance data for both CPU and memory and identify performance bottlenecks through analysis. There are three ways to analyze performance data files: analyzing sample graphs, analyzing flame graphs, and using the go tool pprof interactive mode to view detailed data. Because flame graphs are intuitive and efficient, I recommend using them more often to analyze performance.

Exercises #

  1. Consider why the calling time of the function runtime.goschedImpl must be less than or equal to the accumulated sampling time of the function runtime.schedule.
  2. In your development of Go projects, what are some good performance analysis ideas and methods? Please feel free to share them in the comments.

Feel free to communicate and discuss with me in the comments section. See you in the next lecture.