00 Preface Don't Let Linux Performance Issues Become Your Stumbling Block

00 Preface Don’t Let Linux Performance Issues Become Your Stumbling Block #

Hello, I’m Ni Pengfei, a veteran in cloud computing and a maintainer of the Kubernetes project. I am primarily responsible for the implementation of the open-source container orchestration system Kubernetes in Azure.

I have been working in the field of cloud computing for a long time. My interest in server performance can be traced back to when I first started working. Why did I start exploring performance issues so early? It actually stems from an unforgettable “accident”.

At that time, I was working at Shanda Cloud. After working late into the night, I had just lay down to rest when I suddenly received a large number of alerts. Hurriedly getting up and logging into the servers, I found out that the CPU usage of some system processes had reached 100%.

At that time, I was completely clueless and could only see the symptoms. I had no idea where to start troubleshooting and resolving the issue. In the end, I couldn’t come up with a solution, and this release became a painful memory for me.

Since then, I started to search various relevant books, from the principles of operating systems to the Linux kernel, and even hardware drivers. However, even after learning so much knowledge, I still couldn’t quickly solve similar performance problems.

Therefore, I started searching online and seeking advice from technical experts in the company. I learned a lot about performance optimization strategies and methods, and during this time, I tried numerous Linux performance tools. Through continuous practice and summarization, I finally learned how to correlate observed performance problems with system principles, especially how to connect different levels of the system, from application programs, library functions, system calls, to the kernel and hardware.

This period of learning could be regarded as my “dark” experience. I believe it’s not just me, but many people have experienced similar frustrations. For example:

During peak traffic hours, when the server CPU usage is alarmingly high, after logging into Linux and running the top command, you have no idea how to further pinpoint whether it’s due to insufficient CPU resources or a problem with the concurrent parts of the program.
There are no memory-consuming programs running on the system, but after running the free command, you find out that there is hardly any memory left. So, what is consuming the memory and why?
Early in the morning, you receive a Zabbix alert, and you find that the I/O wait of a database host storing monitoring data is high. What should you do in such a situation?

These problems or scenarios are surely encountered by most people to some extent.

In reality, performance optimization has always been a “taboo” for most software engineers, and even experienced engineers who have worked for many years cannot accurately analyze many performance issues in production environments.

Why is performance optimization so difficult? I think it’s mainly because performance optimization is a systems engineering task that affects the entire system with just a slight change. It involves various aspects from program design, algorithm analysis, programming languages, to underlying infrastructure like system, storage, and network. Every component is likely to have issues, and it’s very possible for multiple components to have issues at the same time.

There is no doubt that performance optimization is one of the most challenging tasks in software systems, but from a different perspective, it is also one of the most demanding tasks that tests your comprehensive abilities. If you can master every key aspect of performance optimization, I can confidently say that you are already an outstanding software engineer.

So, how can you acquire this skill? You can, like I mentioned earlier, spend a lot of time and effort delving into it, practicing from theory to real-world scenarios. Of course, that approach is feasible, but it may also involve taking many detours. It’s possible that you’ve studied massive books and finally grasped the most difficult underlying system, but due to a lack of practical experience, you still have no idea how to tackle real-world development tasks. In fact, for most of us, the best way to learn is definitely to learn with questions, rather than starting with those thick theoretical books, which can easily crush our confidence.

I believe that learning should focus on key points. As long as you understand the basic principles and collaboration of a few system components, grasp the basic performance indicators and tools, and learn the common techniques for performance optimization in practical work, you will be able to accurately analyze and optimize most performance issues. Based on this understanding, then you can go back and read those classic operating systems or other books, which will make your learning more efficient.

Therefore, in this column, I will use a case-driven approach to explain the basic indicators, tools, and corresponding observation, analysis, and tuning methods for Linux performance.

Specifically, I will divide it into 5 modules. For the first 4 modules, I will start from the perspective of resource usage, guiding you to analyze various performance issues that Linux resources may encounter, including CPU performance, disk I/O performance, memory performance, and network performance. Each module is further divided into four different chapters, from shallow to deep.

Basic Chapter, introduces the essential principles, performance indicators, and performance tools of Linux. For example, how to understand average load, how to understand context switching, and the working principles of Linux memory, etc.
Case Chapter, here I will help you analyze how experts observe, locate, analyze, and optimize performance issues through simulated cases when encountering resource bottlenecks.
Routine Chapter, after understanding the basics and experiencing the simulated cases, I will help you summarize the overall approach to troubleshooting, which is the general steps for investigating performance issues. In this way, when you encounter problems in the future, you can follow this path.
Q&A Chapter, I believe that after completing each module, you will have many questions. In the Q&A chapter, I will systematically answer the frequently asked questions for you.

The 5th comprehensive practical module will restore real work scenarios for you, guiding you through “the advanced battlefield” step by step, so that you can integrate all the knowledge learned before and apply it immediately in your work.

Throughout this column, I will try to make the content easy to understand, help you identify key points, and organize the knowledge structure. Through case analysis and routine summary, I will enable you to learn more thoroughly and apply the knowledge more proficiently.

The formal classes will start tomorrow. Before we begin, I’d like to share a quote that I particularly agree with, which was said by He Jiong: “If you want to gain something, you must learn to put in effort and persist. If you truly find it difficult, then give up. If you give up, don’t complain. That’s life. The world is balanced, and everyone determines their own life through their own efforts.”

For no other reason, I just hope that you can stick with me until the very end, until the last article. During this process, if there are things that you don’t understand, you should ponder it yourself a few more times. If you still don’t understand, you can find me in the comment section to ask. You should also take notes on knowledge points that need to be summarized and refined. You can also write down your own experiences, record your analysis steps and thoughts, and I will reply in a timely manner.

Finally, you can set a goal for yourself in the comment section, even if it’s just to check in your learning progress there, I believe it will have an effect. After 3 months, we will review together.

In short, let’s join hands together and deliver the “Linux Performance Optimization” major skill to you!

Linux Knowledge Map 2.0 Collector’s Edition. 2000 copies in stock, with a 5-meter-long diagram packed into a backpack, allowing you to locate 80% of common problems within 1 minute.