36 How Tos How to Evaluate System Network Performance

36 How-tos How to Evaluate System Network Performance #

Hello, I’m Ni Pengfei.

In the previous section, we reviewed the classic C10K and C1000K problems. As a brief recap, C10K refers to the challenge of handling 10,000 requests (with 10,000 concurrent connections) on a single machine, while C1000K is about supporting 1 million requests (with 1 million concurrent connections) on a single machine.

Optimizing the I/O model is the best solution to the C10K problem. epoll, introduced in Linux 2.6, perfectly addresses the C10K challenge and is still widely used today. Many high-performance network solutions today are still based on epoll.

Naturally, as internet technology becomes more prevalent, higher performance demands arise. From C10K to C100K, we only need to increase the physical resources of the system to meet the requirements. However, going from C100K to C1000K, simply adding physical resources is no longer sufficient.

At this point, it is necessary to carry out unified optimization of the system’s software and hardware, from interrupt handling in hardware to the number of file descriptors in the network protocol stack, connection state tracking, and caching queues. Additionally, the entire network chain, including the working model of the application program, needs to be thoroughly optimized.

Furthermore, achieving C10M cannot be solved by simply adding physical resources or tuning the kernel and applications. The lengthy network protocol stack in the kernel becomes the biggest burden.

To address this, it is necessary to use XDP to process network packets before they reach the kernel protocol stack.
Alternatively, one can use DPDK to directly bypass the network protocol stack and handle packets in user space through polling.

Among these, DPDK is currently the most popular high-performance network solution, but it requires the use of network cards that support DPDK.

Of course, in most scenarios, we do not need to handle 10 million concurrent requests on a single machine. Instead, a simpler and more scalable approach is to distribute requests to multiple servers for parallel processing.

However, in such cases, it is crucial to evaluate the network performance of the system in order to assess its processing capabilities and provide benchmark data for capacity planning.

So, how do we evaluate the network performance? Today, I’ll walk you through this question.

Performance Indicators Review #

Before evaluating network performance, let’s review the indicators used to measure it. In the Linux Network Basics course, we mentioned that bandwidth, throughput, latency, and PPS (Packets Per Second) are the most commonly used network performance indicators. Do you remember their specific meanings? Take a moment to think about it before continuing.

Firstly, bandwidth represents the maximum transmission rate of a link, and it is measured in b/s (bits per second). When choosing a network card for a server, bandwidth is the most important reference indicator. Commonly used bandwidths include 1000M, 10G, 40G, and 100G.

Secondly, throughput represents the maximum data transfer rate without any packet loss, and it is usually measured in b/s (bits per second) or B/s (bytes per second). Throughput is limited by the bandwidth, and throughput/bandwidth indicates the utilization rate of the network link.

Thirdly, latency represents the time delay from sending a network request until receiving the remote response. This indicator may have different meanings in different scenarios. It can represent the time required to establish a connection (such as TCP handshake delay) or the time required for a round trip of a data packet (such as RTT, Round Trip Time).

Lastly, PPS is the abbreviation for Packets Per Second, which represents the transmission rate in terms of network packets. PPS is usually used to evaluate the forwarding capability of a network, and the forwarding performance based on Linux servers is easily affected by the size of network packets (switches, on the other hand, are usually not heavily affected and can forward packets linearly).

Among these four indicators, bandwidth is directly associated with the physical network card configuration. Generally, once a network card is determined, the bandwidth is also determined (although the actual bandwidth will be limited by the smallest module in the entire network link).

Additionally, you may have heard of “network bandwidth testing” in many places. However, what is actually tested is network throughput, not bandwidth. The network throughput of a Linux server is generally smaller than the bandwidth, while for dedicated network devices such as switches, the throughput is generally close to the bandwidth.

Finally, PPS is the network transmission rate measured in terms of network packets. It is usually used in scenarios that require a large amount of forwarding. When it comes to TCP or web services, metrics such as concurrent connection numbers and queries per second (QPS) are more commonly used. These metrics better reflect the performance of actual applications.

Network Benchmark Testing #

After familiarizing yourself with the performance indicators of networks, let’s take a look at how to determine the benchmark values of these indicators through performance testing.

You can start by thinking about a question. We already know that the Linux network is based on the TCP/IP protocol stack, and clearly, different protocol layers behave differently. Hence, before testing, you should clarify which layer of the protocol stack the network performance you want to evaluate belongs to. In other words, on which layer of the protocol stack is your application based?

According to the principles of the TCP/IP protocol stack that we learned earlier, this question should not be difficult to answer. For example:

Web applications based on HTTP or HTTPS obviously belong to the application layer and require us to test the performance of HTTP/HTTPS.
For most game servers, in order to support a larger number of concurrent online users, they usually interact with clients based on TCP or UDP. In this case, we need to test the performance of TCP/UDP.
Of course, there are also scenarios where Linux is used as a software switch or router. In this case, you pay more attention to the processing capacity of network packets (i.e., PPS) and focus on the forwarding performance of the network layer.

Next, I will take you from the bottom up to understand the network performance testing methods of different protocol layers. However, please note that the lower-layer protocols are the basis of the upper-layer network protocols. Naturally, the performance of the lower-layer protocols determines the performance of the higher-layer network.

Please note that all the testing methods mentioned below require two Linux virtual machines. One of them can be used as the target machine to be tested, and the other one can act as the client running the network service and running the testing tools.

Performance Testing for Each Protocol Layer #

Forwarding Performance #

Let’s start by looking at the network interface layer and the network layer, which are mainly responsible for packet encapsulation, addressing, routing, and sending and receiving. In these two network protocol layers, the most important performance metric is the number of packets processed per second (PPS). The processing capability of 64B small packets is particularly worth our attention. So, how do we test the packet processing capability of the network?

Speaking of network packet-related testing, you may find it unfamiliar. However, in the CPU performance section at the beginning of the column, we have come across a related tool, which is hping3 in the soft interrupt case.

In that scenario, hping3 was used as a tool for SYN attacks. In fact, hping3 has more uses as a performance tool for testing network packet processing capabilities.

Today, let me introduce another more commonly used tool, pktgen, a high-performance network testing tool built into the Linux kernel. Pktgen supports a wealth of custom options, allowing you to construct the required network packets according to your actual needs, thus accurately testing the performance of the target server.

However, in a Linux system, you cannot directly find the pktgen command. This is because pktgen runs as a kernel thread and requires you to load the pktgen kernel module and then interact through the /proc file system. The following are the two kernel threads started by pktgen and the interaction files in the /proc file system:

$ modprobe pktgen
$ ps -ef | grep pktgen | grep -v grep
root     26384     2  0 06:17 ?        00:00:00 [kpktgend_0]
root     26385     2  0 06:17 ?        00:00:00 [kpktgend_1]
$ ls /proc/net/pktgen/
kpktgend_0  kpktgend_1  pgctrl

Pktgen starts a kernel thread on each CPU and can interact with these threads through files with the same name under /proc/net/pktgen. The pgctrl file is mainly used to control the start and stop of this test.

If the modprobe command fails, it means that your kernel does not have the CONFIG_NET_PKTGEN option configured. To use it, you need to configure the pktgen kernel module (i.e., CONFIG_NET_PKTGEN=m), recompile the kernel, and then you can use it.

When using pktgen to test network performance, you need to configure pktgen options for each kernel thread kpktgend_X and the test network card before starting the test through pgctrl.

Taking packet transmission testing as an example, suppose the sending machine uses the network card eth0, and the IP address of the target machine is 192.168.0.30, and the MAC address is 11:11:11:11:11:11.

Next, is an example of a packet transmission test.

# Define a utility function to conveniently configure various test options later
function pgset() {
    local result
    echo $1 > $PGDEV

    result=`cat $PGDEV | fgrep "Result: OK:"`
    if [ "$result" = "" ]; then
         cat $PGDEV | fgrep Result:
    fi
}

# Bind eth0 network card to thread 0
PGDEV=/proc/net/pktgen/kpktgend_0
pgset "rem_device_all"   # Unbind network card
pgset "add_device eth0"  # Add eth0 network card

# Configure test options for eth0 network card
PGDEV=/proc/net/pktgen/eth0
pgset "count 1000000"    # Total number of packets to send
pgset "delay 5000"       # Delay between different packets (in nanoseconds)
pgset "clone_skb 0"      # SKB packet copying
pgset "pkt_size 64"      # Packet size
pgset "dst 192.168.0.30" # Destination IP
pgset "dst_mac 11:11:11:11:11:11"  # Destination MAC

# Start the test
PGDEV=/proc/net/pktgen/pgctrl
pgset "start"

Wait for a while, and after the test is completed, the result can be obtained from the /proc file system. We can view the test report with the following code snippet:

$ cat /proc/net/pktgen/eth0
Params: count 1000000  min_pkt_size: 64  max_pkt_size: 64
     frags: 0  delay: 0  clone_skb: 0  ifname: eth0
     flows: 0 flowlen: 0
...

Current:
    pkts-sofar: 1000000  errors: 0
    started: 1534853256071us  stopped: 1534861576098us idle: 70673us
...
Result: OK: 8320027(c8249354+d70673) usec, 1000000 (64byte,0frags)
    120191pps 61Mb/sec (61537792bps) errors: 0

As you can see, the test report is mainly divided into three parts:

* The first part, Params, contains the test options;

* The second part, Current, shows the progress of the test. "pkts-sofar" indicates that 1 million packets have been sent, meaning the test is complete.

* The third part, Result, contains the test results: the time it took, the number of network packets and fragments, PPS, throughput, and errors.

According to the above results, we can see that the PPS is 120,191 and the throughput is 61 Mbps. No errors occurred. So, is a PPS of 120,000 good or not?

As a comparison, let's calculate the PPS of a gigabit switch. A switch can achieve wire-speed (error-free forwarding under full load), and its PPS is calculated as 1000 Mbps divided by the size of an Ethernet frame, which is (64+20) * 8 bits. Therefore, the PPS is 1.5 million PPS. (Note: The 20 bytes refers to the size of the Ethernet frame's preamble and inter-frame spacing.)

You see, even the PPS of a gigabit switch can reach 1.5 million PPS, much higher than the 120,000 PPS we obtained from the test. So, you don't need to worry about this value. Nowadays, multi-core servers and 10 Gigabit network cards are very common, and with a little optimization, you can achieve millions of PPS. Furthermore, if you use DPDK or XDP, which were mentioned in the previous lesson, you can reach tens of millions of PPS.

### TCP/UDP Performance

Having learned how to test PPS, let's now look at the methods for testing TCP and UDP performance. When it comes to testing TCP and UDP, you're probably already familiar with it and may even immediately think of corresponding testing tools, such as iperf or netperf.

Especially in this era of cloud computing, when you first receive a batch of virtual machines, the first thing you should do is use iperf to test whether the network performance meets expectations.

iperf and netperf are the most commonly used network performance testing tools. They test the throughput of TCP and UDP. They both operate by having a client communicate with a server and test the average throughput over a period of time.

Let's take iperf as an example and look at the method for testing TCP performance. Currently, the latest version of iperf is iperf3. You can install it by running the following command:

# Ubuntu
apt-get install iperf3
# CentOS
yum install iperf3

Then, start the iperf server on the target machine:

# -s indicates starting the server, -i indicates the interval for reporting, -p indicates the listening port
$ iperf3 -s -i 1 -p 10000

Next, run the iperf client on another machine and initiate the test:

# -c indicates starting the client, 192.168.0.30 is the IP address of the target server
# -b indicates the target bandwidth (in bits/s)
# -t indicates the test duration
# -P indicates the number of parallel connections, -p indicates the listening port of the target server
$ iperf3 -c 192.168.0.30 -b 1G -t 15 -P 2 -p 10000

After waiting a moment (15 seconds) for the test to finish, go back to the target server and view the iperf report:

[ ID] Interval           Transfer     Bandwidth
...
[SUM]   0.00-15.04  sec  0.00 Bytes  0.00 bits/sec                  sender
[SUM]   0.00-15.04  sec  1.51 GBytes   860 Mbits/sec                  receiver

The final SUM line is the summary of the test results, including the test duration, data transfer, and bandwidth, among others. This section is further divided into sender and receiver lines based on sending and receiving.

From the test results, you can see that the TCP bandwidth (throughput) received by this machine is 860 Mbps, which is somewhat lower than the target of 1 Gbps.

### HTTP Performance

Moving up from the transport layer to the application layer, some applications directly build services based on TCP or UDP. Of course, there are also many applications that build services based on application-layer protocols. HTTP is the most commonly used application-layer protocol. For example, widely used web servers such as Apache and Nginx are all based on HTTP.

To test the performance of HTTP, there are various tools available, such as ab and webbench, which are commonly used HTTP stress testing tools. Among them, ab is an HTTP benchmarking tool that comes with Apache, mainly used to test metrics such as the number of requests per second, request latency, throughput, and request latency distribution of an HTTP service.

To install the ab tool, run the following command:

# Ubuntu
$ apt-get install -y apache2-utils
# CentOS
$ yum install -y httpd-tools

Next, on the target machine, use Docker to start an Nginx service and then use ab to test its performance. First, run the following command on the target machine:

$ docker run -p 80:80 -itd nginx

On another machine, run the ab command to test the performance of Nginx:

# -c represents the number of concurrent requests (1000) and -n represents the total number of requests (10000)
$ ab -c 1000 -n 10000 http://192.168.0.30/
...
Server Software:        nginx/1.15.8
Server Hostname:        192.168.0.30
Server Port:            80

...

Requests per second:    1078.54 [#/sec] (mean)
Time per request:       927.183 [ms] (mean)
Time per request:       0.927 [ms] (mean, across all concurrent requests)
Transfer rate:          890.00 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   27 152.1      1    1038
Processing:     9  207 843.0     22    9242
Waiting:        8  207 843.0     22    9242
Total:         15  233 857.7     23    9268

Percentage of the requests served within a certain time (ms)
  50%     23
  66%     24
  75%     24
  80%     26
  90%    274
  95%   1195
  98%   2335
  99%   4663
 100%   9268 (longest request)

As you can see, the test results of ab are divided into three sections: request summary, connection time summary, and request latency summary. Let’s look at the above result in more detail.

In the request summary section, you can see:

Requests per second: 1074
Time per request: The first line “927 ms” represents the average latency per request, including thread scheduling time and network response time, while the next line “0.927 ms” represents the actual response time per request.
Transfer rate: 890 KB/s represents the throughput.

The connection time summary section shows the various time metrics for establishing connections, processing requests, waiting, and overall processing, including the minimum, maximum, average, and median processing times.

The request latency summary section provides the percentage of requests processed within different time periods. For example, 90% of the requests can be completed within 274 ms.

Application Load Performance #

When using testing tools like iperf or ab to obtain performance data for TCP, HTTP, etc., can these data accurately represent the actual performance of the application? I believe your answer should be negative. For example, your application is based on the HTTP protocol and provides a web service to end users. In this case, you can use the ab tool to obtain the accessibility performance of a certain page, but this result may be inconsistent with the actual requests from users. This is because user requests often come with various payloads, and these payloads can affect the internal processing logic of the web application, thereby affecting the overall performance.

Therefore, in order to obtain the actual performance of the application, it is required that the performance tool itself can simulate the user’s request payload. Tools like iperf and ab are powerless in this aspect. Luckily, we still have options like wrk, TCPCopy, Jmeter, or LoadRunner to achieve this goal.

Taking wrk as an example, it is an HTTP performance testing tool that comes with LuaJIT built-in, making it convenient for you to generate the required request payload or customize the response handling method according to your actual needs.

The wrk tool itself does not provide yum or apt installation methods, so it needs to be installed by compiling the source code. For example, you can run the following commands to compile and install wrk:

$ https://github.com/wg/wrk
$ cd wrk
$ apt-get install build-essential -y
$ make
$ sudo cp wrk /usr/local/bin/

The command line parameters of wrk are relatively simple. For instance, we can use wrk to retest the performance of the previously launched Nginx server.

# -c represents 1000 concurrent connections, -t represents 2 threads
$ wrk -c 1000 -t 2 http://192.168.0.30/
Running 10s test @ http://192.168.0.30/
  2 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    65.83ms  174.06ms   1.99s    95.85%
    Req/Sec     4.87k   628.73     6.78k    69.00%
  96954 requests in 10.06s, 78.59MB read
  Socket errors: connect 0, read 0, write 0, timeout 179
Requests/sec: 9641.31
Transfer/sec: 7.82MB

In this example, we used 2 threads and 1000 concurrent connections to retest the performance of Nginx. As you can see, the throughput is 9641 requests per second and 7.82MB per second, with an average latency of 65ms, which is much better than the results obtained from ab testing.

This also indicates that the performance of the performance tool itself is crucial for accurate performance testing of the application.

Of course, the biggest advantage of wrk is its built-in LuaJIT, which can be used to implement performance testing for complex scenarios. In wrk, when calling Lua scripts, you can divide the HTTP requests into three stages: setup, running, and done, as shown in the following diagram:

(Image from NetEase Cloud Blog)

For example, in the setup stage, you can set authentication parameters for the requests (as shown in the official wrk example):

-- example script that demonstrates response handling and
-- retrieving an authentication token to set on all future
-- requests

token = nil
path  = "/authenticate"

request = function()
   return wrk.format("GET", path)
end

response = function(status, headers, body)
   if not token and status == 200 then
      token = headers["X-Token"]
      path  = "/resource"
      wrk.headers["X-Token"] = token
   end
end

When executing the test, you can use the -s option to specify the path of the Lua script:

$ wrk -c 1000 -t 2 -s auth.lua http://192.168.0.30/

wrk requires you to use Lua scripts to construct the request payload, which is sufficient for most scenarios. But its disadvantage is that everything needs to be constructed with code, and the tool itself does not provide a GUI environment.

On the other hand, tools like Jmeter or LoadRunner (commercial product) provide more rich features for complex scenarios, such as script recording, playback, GUI, etc., and are more convenient to use.

Summary #

Today, I took you through a review of network performance metrics and introduced methods for evaluating network performance.

Performance evaluation is the premise for optimizing network performance. Only when you identify network performance bottlenecks do you need to optimize network performance. According to the principles of the TCP/IP protocol stack, different protocol layers focus on different performance aspects, which correspond to different performance testing methods. For example:

At the application layer, you can use tools such as wrk and Jmeter to simulate user loads and test metrics like requests per second, processing delay, and error count of the application program.
At the transport layer, you can use tools like iperf to test TCP throughput.
Going further down, you can also use pktgen, a tool built into the Linux kernel, to test server packets per second (PPS).

Since lower-layer protocols form the foundation of higher-layer protocols, generally, we need to perform performance testing on each protocol layer from top to bottom. Based on the results of performance testing and combining them with the principles of the Linux network protocol stack, we can identify the root cause of the performance bottleneck and optimize network performance accordingly.

Reflection #

Finally, I would like to have a chat with you.

How do you evaluate network performance?
When evaluating network performance, which protocol layer do you start from and which metrics do you choose as the core objectives of performance testing?
What tools do you use to test and analyze network performance?

You can summarize your thoughts based on the network knowledge you have learned today.

Feel free to discuss with me in the comment section, and also feel free to share this article with your colleagues and friends. Let’s practice in real scenarios and improve through communication.