09 Practical Implementation of North South Traffic Component IP Vs

09 Practical Implementation of North-South Traffic Component IPVS #

We know that the traffic management of Kubernetes worker nodes is managed by kube-proxy. Kube-proxy utilizes the network traffic conversion capability of iptables to create a layer of cluster virtual network at the data level of the entire cluster, which is the term ClusterIP you see in the Service object, that is, the cluster network IP. Since iptables already supports traffic load balancing perfectly and can implement reverse proxy for north-south traffic, why do we still need to use another system component IPVS to replace it?

The main reason is that iptables has a limited capacity to handle Service objects. Once it exceeds 1000, performance bottlenecks start to appear. Currently, IPVS mode is the default recommended proxy for Kubernetes, which forces us to start understanding the mechanism of IPVS, familiarize ourselves with its application scope, and compare its advantages and disadvantages with iptables, so that we can focus more on application development.

Introduction of IPVS Introduced by a Large-Scale Service Performance Evaluation #

iptables has always been a system component that Kubernetes clusters rely on, and it is also a kernel module of Linux. Usually, we do not perceive its performance issues during the practice. A developer from Huawei raised a question at KubeCon 2018 in a scenario of super large scale, such as 10,000 services:

Can kube-proxy maintain efficient north-south traffic forwarding performance in a scenario with a large number of services such as 10,000?

Through testing data, the answer is negative. When the scale of Pod instances reaches tens of thousands, iptables starts to impact system performance. We also need to know the reasons why iptables cannot operate stably.

Firstly, both IPVS mode and iptables mode are based on Netfilter. When generating load balancing rules, IPVS forwards traffic based on hash table, while iptables forwards traffic by traversing rule by rule. Since iptables matching rules need to match one rule at a time from top to bottom, it will definitely increase CPU consumption and reduce forwarding efficiency as the scale of rules increases. In comparison, IPVS’s hash table solution has a limited scope of table lookup after generating load balancing rules for Services, so its forwarding performance directly surpasses iptables mode.

Secondly, we need to be clear that iptables is ultimately a tool for configuring the firewall model, and its implementation goal is different from that of professional load balancing component IPVS. Therefore, we cannot simply say that IPVS is better than iptables. After enabling IPVS mode, it only replaces the forwarding of north-south traffic, while the NAT conversion of east-west traffic still needs to be supported by iptables. In order to have a more comprehensive understanding of the impact of their performance comparison, you can refer to the following figure:

ipvs-iptables-compare

From the figure, it can be seen that when the scale of Service instances exceeds 1000, iptables and IPVS start to have different impacts on CPU, and the larger the scale, the more obvious the impact. Obviously, this is because their different ways of querying forwarding rules lead to performance differences.

In addition to the advantages of rule matching retrieval, IPVS also provides some more flexible load balancing algorithm features compared to iptables, such as:

  • rr: round-robin scheduling
  • lc: least-connection scheduling
  • dh: destination hashing
  • sh: source hashing
  • sed: shortest expected delay
  • nq: never-queue

In iptables, we only have the default round-robin scheduling feature.

Now let’s review the definition of IPVS (IP Virtual Server). It is a Linux kernel module built on top of the Linux Netfilter module to implement north-south traffic load balancing. IPVS supports forwarding of Layer 4 address requests and binds the unique virtual IP in the Service layer to the IP group of container replicas to achieve traffic load balancing. Therefore, it naturally complies with the definition of Service in Kubernetes.

Brief Introduction of IPVS Usage #

Install ipvsadm, the management tool for LVS:  

sudo apt-get install -y ipvsadm

Create a virtual service:  

sudo ipvsadm -A -t 100.100.100.100:80 -s rr

Create 2 instances using containers:  

$ docker run -d -p 8000:8000 --name first -t jwilder/whoami
cd977829ae0c76236a1506c497d5ce1628f1f701f8ed074916b21fc286f3d0d1

$ docker run -d -p 8001:8000 --name second -t jwilder/whoami
5886b1ed7bd4095cb02b32d1642866095e6f4ce1750276bd9fc07e91e2fbc668

Find out the container addresses:  

$ docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' first
172.17.0.2

$ docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' second
172.17.0.3
$ curl 172.17.0.2:8000
I'm cd977829ae0c

Configure IP binding to the virtual service IP:

$ sudo ipvsadm -a -t 100.100.100.100:80 -r 172.17.0.2:8000 -m
$ sudo ipvsadm -a -t 100.100.100.100:80 -r 172.17.0.3:8000 -m

$ ipvsadm -l
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  100.100.100.100:http rr
  -> 172.17.0.2:8000              Masq    1      0          0
  -> 172.17.0.3:8000              Masq    1      0          0

With that, the IPVS load configuration steps have been demonstrated. kube-proxy also operates IPVS in the same way.

Explanation of kube-proxy’s IPVS mode configuration parameters #

  • –proxy-mode parameter: When configured as --proxy-mode=ipvs, it immediately activates the IPVS NAT forwarding mode to achieve traffic load balancing for POD ports.
  • –ipvs-scheduler parameter: Modifies the load balancing strategy. The default is rr, which stands for round-robin scheduling.
  • –cleanup-ipvs parameter: Clears historical residual rules before starting IPVS.
  • –ipvs-sync-period and –ipvs-min-sync-period parameters: Configure the synchronization period for rules. For example, 5s, which must be greater than 0.
  • –ipvs-exclude-cidrs parameter: Filters self-created IPVS rules, allowing you to retain previous load balancing traffic.

Exceptions when using IPVS mode #

Despite IPVS being used for traffic load balancing, there are still scenarios where it is not helpful. For example, when packet filtering, port mirroring, SNAT, and other requirements are needed, iptables is still relied upon. In addition, there are 4 cases where IPVS mode falls back to iptables mode.

  • kube-proxy is started with the --masquerade-all=true parameter.
  • kube-proxy is started with the cluster network defined.
  • Loadbalancer type of Service is supported.
  • NodePort type of Service is supported.

To optimize the generation of too many iptables rules, kube-proxy also introduces the ipset tool to reduce iptables rules. The following table lists the ipset rules maintained under IPVS mode:

set name members usage
KUBE-CLUSTER-IP All Service IP + port masquerade for cases that masquerade-all=true or clusterCIDR specified
KUBE-LOOP-BACK All Service IP + port + IP masquerade for resolving hairpin issue
KUBE-EXTERNAL-IP Service External IP + port masquerade for packets to external IPs
KUBE-LOAD-BALANCER Load Balancer ingress IP + port masquerade for packets to Load Balancer type service
KUBE-LOAD-BALANCER-LOCAL Load Balancer ingress IP + port with externalTrafficPolicy=local accept packets to Load Balancer with externalTrafficPolicy=local
KUBE-LOAD-BALANCER-FW Load Balancer ingress IP + port with loadBalancerSourceRanges Drop packets for Load Balancer type Service with loadBalancerSourceRanges specified
KUBE-LOAD-BALANCER-SOURCE-CIDR Load Balancer ingress IP + port + source CIDR accept packets for Load Balancer type Service with loadBalancerSourceRanges specified
KUBE-NODE-PORT-TCP NodePort type Service TCP port masquerade for packets to NodePort(TCP)
KUBE-NODE-PORT-LOCAL-TCP NodePort type Service TCP port with externalTrafficPolicy=local accept packets to NodePort Service with externalTrafficPolicy=local
KUBE-NODE-PORT-UDP NodePort type Service UDP port masquerade for packets to NodePort(UDP)
KUBE-NODE-PORT-LOCAL-UDP NodePort type service UDP port with externalTrafficPolicy=local accept packets to NodePort Service with externalTrafficPolicy=local

Considerations when using IPVS #

Using IPVS mode in Kubernetes can cover most scenarios for traffic load balancing. However, it is unable to horizontally scale and distribute long-lived TCP connections. This is because IPVS does not have the ability to limit keepalive_requests. If you encounter such a scenario, a temporary workaround is to change the connection mode from long-lived to short-lived. For example, setting a request value (e.g. 1000) will mark Connection:close in the HTTP header of the server response, informing the client to close the connection after completing the current request. New requests would require establishing a new TCP connection, so there would be no request failures during this process. At the same time, it achieves the goal of converting long-lived connections into short-lived connections as needed. Of course, as a long-term solution, you need to deploy a group of Nginx or HAProxy clusters in front of the cluster to effectively help you limit the threshold for long-lived connection requests, thus achieving elastic scaling of traffic.

Summary of Practice #

The introduction of IPVS mode is an optimization solution introduced by the community for high-performance cluster testing. By replacing the load balancing implementation of iptables with the existing IPVS module in the kernel, it can be considered a very successful best practice. Since IPVS is built into the Kernel, the Kernel version does have a significant impact on IPVS, so it must be considered when using. While I was still pondering over the tension between IPVS and iptables, the Linux community was already advancing a new technology called eBPF (extended Berkeley Packet Filter) as a replacement for iptables and IPVS. If you haven’t heard of this technology, you must have seen the Cilium container networking solution, which is based on eBPF:

8-2-cilium-diagram

Through eBPF technology, traffic forwarding and container network interconnection have already been implemented on higher versions of the Kernel. We look forward to the day when eBPF can perfectly replace iptables and IPVS, providing us with a more powerful and higher-performance traffic management component.