36 Discussing Various Debugging Techniques in Open Resty

36 Discussing Various Debugging Techniques in OpenResty #

Hello, I’m Wen Ming.

In the OpenResty community, developers often raise the question: how to debug in OpenResty? As far as I know, there are some debugging tools available in OpenResty, including plugins in VSCode. However, these tools are not widely used. Even the author agentzh and several contributors I know, all use basic methods such as ngx.log and ngx.say for debugging.

Obviously, this is not very user-friendly for most beginners. Does it mean that the core maintainers of OpenResty only rely on logging when encountering difficult problems?

Of course not. In the world of OpenResty, SystemTap and flame graphs are the standard tools for handling tricky problems and performance issues. If you ask such questions on the mailing list or in an issue, the project maintainers will definitely ask you to upload a flame graph and prefer visual representation over textual description.

In the next two lessons, I will talk to you about debugging and the toolset specifically created for debugging in OpenResty. Today, let’s first take a look at the various debugging methods available.

Breakpoints and Logging #

For a long time in my work, I have relied on the advanced debugging features of the editor to trace the program. This seems natural. For problems that can be reproduced in the testing environment, no matter how complex they are, I am confident in finding the root cause. This is because the bug can be constantly replicated. By setting breakpoints and adding logging, the root cause of the problem will gradually emerge. All you need is patience.

From this perspective, solving bugs that can be stably reproduced in the testing environment is actually a physically demanding task. The majority of bugs I have solved in my work belong to this category.

However, it is important to note that there are two prerequisites here: the testing environment and stable reproduction. Reality is never so ideal, so is there a way to debug bugs that only occur in the production environment?

Here I recommend a tool called Mozilla RR. You can think of it as a tape recorder that records the behavior of the program and then replays it repeatedly. In simple terms, whether it’s the production environment or the testing environment, as long as you can record the “evidence” of the bug, you can slowly analyze it as “evidence” in court.

Binary Search and Annotations #

However, for large projects or systems with wide coverage, such as when bugs can come from multiple services or when there might be issues with SQL queries in a database, even if a bug can be consistently reproduced, you cannot determine which part of the system the bug is occurring in. In this case, tools like Mozilla RR for recording become ineffective.

At this point, you may recall the classic algorithm of “binary search”. First, we comment out half of the logic in the code. If the problem still persists, then it means the bug is in the portion of the code that has not been commented out. At this point, we comment out the remaining half of the logic and continue the previous cycle. After a few iterations, the problem is narrowed down to a completely controllable range.

Although this approach may sound a bit silly, it is indeed effective in many scenarios. Of course, with the advancement of technology and the increasing complexity of systems, we now recommend using standards like OpenTracing for distributed tracing.

OpenTracing can be used to instrument various parts of a system and report the call chains and instrumentation data consisting of multiple spans to the server using a Trace ID. This allows for analysis and graphical presentation of the data. With this, many hidden problems can be discovered and historical data is also saved, making it convenient for us to compare and view at any time.

In addition, if your system is complex, such as in a microservices environment, Zipkin and Apache SkyWalking are both good choices.

Dynamic Debugging #

The debugging methods I mentioned earlier can basically solve most of the problems. However, if you encounter a problem that only occurs occasionally in the online environment, it will take a considerable amount of time to track it by adding logs and tracking points.

I once encountered such a bug. Many years ago, I was responsible for a system that would exhaust the database resources and cause the entire system to collapse around 1 am every day. During the day, we checked the scheduled tasks in the code, and in the evening, our team members waited at the company for the bug to occur again, and then checked the running status of their respective submodules. It was not until the third night that we found the root cause of the bug.

My experience is similar to the background of several Solaris system engineers who created Dtrace. At that time, the engineers at Solaris also spent several days and nights investigating a strange online problem, only to find that it was caused by a misconfiguration. But what is different from me is that the Solaris engineers decided to completely avoid this kind of problem, so they invented Dtrace, which is specifically used for dynamic debugging.

Dynamic debugging, also known as live debugging, is different from static debugging tools like GDB. Dynamic debugging can debug online services, and for the program being debugged, the entire debugging process is transparent and non-intrusive. You don’t need to modify the code or restart. To give an analogy, dynamic debugging is like X-rays, which can examine the body without the patient’s awareness, without the need for blood tests or endoscopy.

Dtrace is the earliest dynamic tracing framework, and under its influence, similar dynamic debugging tools have gradually appeared in other systems. For example, engineers at Red Hat created Systemtap on the Linux platform, which is the protagonist I will talk about next.

Systemtap #

Systemtap has its own DSL, which is a small language that can be used to set probes. Before delving into more details, let’s first install Systemtap so that we don’t just stay on abstract concepts. You can use the system’s package manager to install it:

sudo apt install systemtap

Now let’s take a look at what a hello world program written in Systemtap looks like:

# cat hello-world.stp
probe begin
{
  print("hello world!")
  exit()
}

Pretty simple, right? However, you need to use sudo privileges to run it:

sudo stap hello-world.stp

It will print the desired hello world!. In most cases, we don’t need to write our own stap scripts for analysis because OpenResty already has many ready-made stap scripts for regular analysis. In the next lecture, I will introduce these scripts to you. So for today, it’s enough to have a basic understanding of stap scripts.

After playing around with it for a while, let’s get back to our concepts. The working principle of Systemtap is to convert the above stap script into C and use the system C compiler to create a kernel module. When the module is loaded, it will activate all the probe events by hooking into the kernel.

For example, the probe in the example code above is a probe. begin runs at the very beginning of the probe, and the corresponding event is end. So the previous hello world program can also be written in the following way:

probe begin
{
  print("hello ")
  exit()
}

probe end
{
print("world!") 

Here, I have only provided a very shallow introduction to Systemtap. In fact, the author of Systemtap, Frank Ch. Eigler, has written an e-book called “Systemtap tutorial” that provides a detailed introduction to Systemtap. If you want to further study and delve into Systemtap, I suggest starting with this book as it is the best learning path.

Other Dynamic Tracing Frameworks #

Of course, for kernel and performance analysis engineers, Systemtap alone is not enough. First, Systemtap does not enter the system kernel by default. Second, its working principle means that it starts relatively slowly and may have an impact on the normal operation of the system.

eBPF (extended BPF) is a new feature added to the Linux kernel in recent years. Compared to Systemtap, eBPF has the advantages of kernel support, no crashes, and fast startup speed. Additionally, it does not use DSL, but directly uses the syntax of the C language, which greatly reduces its learning curve.

In addition to open-source solutions, Intel’s VTune is also one of the powerful tools. Its intuitive interface and data display allow you to analyze performance bottlenecks without having to write code.

Flame Graph #

Finally, let’s recap the flame graph mentioned in previous lessons. As mentioned before, data generated by tools like perf and Systemtap can be visually represented using a flame graph for better understanding. The following image is an example of a flame graph:

In a flame graph, the colors and shades of the blocks are meaningless and only used for simple differentiation. The flame graph actually adds up the data from each sampling, so the meaningful information lies in the width and length of the blocks.

For an on-CPU flame graph, the width of a block represents the percentage of CPU time occupied by a function. The wider the block, the greater the performance impact. If there is a flat-topped peak, it indicates the performance bottleneck. The length of the block represents the call depth of a function, with the topmost box showing the currently running function and the ones below it being its callers. Therefore, functions below are the parent functions of the ones above, and the higher the peak, the deeper the function call hierarchy.

To help you gain a deeper understanding of this powerful tool, I will demonstrate in upcoming video lessons how to use a real code example to identify and address performance bottlenecks with a flame graph.

Conclusion #

It should be noted that even the non-intrusive technology of dynamic tracing is not perfect. It can only detect a single process, and in general, we only enable it briefly to use the sampled data during that time. Therefore, if you need to trace across multiple services or perform long-term monitoring, you still need distributed tracing technologies like OpenTracing.

I’m curious to know which debugging tools and techniques you use in your work. Feel free to leave a comment and discuss with me. Also, feel free to share this article with your friends so that we can learn and make progress together.