01 Rethinking the Core Components of Kubernetes

01 Rethinking the Core Components of Kubernetes #

In this article, we will start introducing the core components of Kubernetes. To help you establish a complete understanding of the Kubernetes architecture in advance, I have summarized the introduction of the core components as follows:

  • kube-apiserver: Provides HTTP REST interfaces for CRUD operations and watching various resource objects (such as Pod, RC, Service) in Kubernetes. It serves as the management entry point for the entire system.
  • kube-controller-manager: As the management control center within the cluster, it is responsible for managing objects such as nodes, pod replicas, service endpoints, namespaces, service accounts, and resource quotas.
  • kube-scheduler: The cluster scheduler, providing rich policy and elastic topology capabilities. The scheduling implementation focuses on availability, performance, and capacity.
  • kube-proxy: Provides north-south traffic load balancing and service discovery with a reverse proxy.
  • kubelet: A control program on the worker node for managing and scheduling pods.
  • etcd: A distributed key-value store that centrally stores data and state for the Kubernetes cluster.
  • cni-plugins: Standard network drivers maintained by the container networking working group, such as Flannel, PTP, host-local, portmap, tuning, VLAN, sample, DHCP, IPVLAN, MACVLAN, loopback, and bridge plugins. This overlay network can only contain one layer and cannot interconnect multiple layers of networks.
  • runc: Container runtime process for running individual containers, compliant with the Open Container Initiative (OCI) standards.
  • cri-o: Container runtime management process, similar to the Docker management tool containerd, widely used in the domestic industry.

To have a deeper understanding of the layout of the core components of Kubernetes, we can refer to the following architecture design diagram:

1-1-k8s-arch

With the introduction above, we have covered the basic knowledge of the core components. Based on feedback from users who have implemented Kubernetes-native technologies in recent years, many still find this system too complex and challenging to manage, and are concerned about its potential fatal impacts on their business systems.

So why are people concerned about the stability of their business systems when the components of Kubernetes are designed for distributed systems? Based on my conversations with users, there is no unified solution that everyone can directly refer to. During the implementation process, each organization can only gain experience by trial and error and continuously improve their implementation plans through iterations. Due to different business scales, the implementation architecture of Kubernetes varies greatly, making it challenging to maintain consistency in the infrastructure across all business enterprise IT environments. Best practices for each business spread based on the experience of others, and what works for user A may not necessarily work for user B.

Of course, apart from the objective limiting factors, the use of Kubernetes aims to maintain consistency in the IT infrastructure of enterprises and scale elastically with the growth of business needs. After all, Kubernetes is an open-source container orchestration system released by Google based on its successful experience with the internal Borg application management system. Its development has accumulated the essence of the entire industry’s experience. Therefore, in the current stage of digital transformation, enterprises are blindly switching to this new environment, fearing that technological lag will hinder business development.

The purpose of this article is to deepen your understanding of Kubernetes components from the perspective of an enterprise and enable you to make good use of them. Therefore, I am planning to analyze the following aspects:

  • Usage strategies for control plane components
  • Usage strategies for worker node components
  • Usage strategies for additional worker node components

Usage Strategies for Control Plane Components #

When it comes to maintaining a Kubernetes cluster as a new user, the first step is usually to follow the installation documentation and set up the cluster. The tools for installing clusters in the market can be divided into two categories:

Tools for learning and installing clusters:

  • Docker Desktop for Kubernetes
  • Kind (Kubernetes IN Docker)
  • Minikube
  • MicroK8s

Another type is production-grade installation tools:

  • kubeadm
  • kops
  • kubespray
  • Rancher Kubernetes Engine (RKE)
  • K3s

Among them, kubeadm and RKE use container components to install Kubernetes clusters. Although running cluster environments in system containers reduces the intrusiveness on the host system, the maintenance cost for operation and maintenance increases linearly. Users who have maintained container applications know that container technology mainly isolates running processes and is not designed as an isolation tool for system processes. Containers have a short lifecycle and can fail at any time. When a container process fails, it is difficult to restore the fault environment. The common solution is to ignore the fault by restarting the container, hoping to quickly troubleshoot the issue.

However, often such potential small problems are annoying and make you frustrated because they cannot be reproduced for a long time. For system processes, Linux has a corresponding system maintenance tool called Systemd. Its ecosystem is mature enough and can maintain consistency in multiple Linux environments. When problems occur, the operation and maintenance personnel can directly log in to the host to quickly troubleshoot system logs to locate and fix errors. Based on this accumulated experience, the author recommends using the native process to maintain the components of Kubernetes in the production environment, allowing the operation and maintenance personnel to focus on optimizing redundancy in the cluster architecture.

Next, let’s re-understand the architecture of the etcd cluster. According to the reference materials in the official Kubernetes documentation, etcd clusters can usually be classified into two types of production-grade Kubernetes clusters according to the topology of the etcd cluster.

Stacked etcd cluster topology:

kubeadm-ha-topology-stacked-etcd

Independent etcd cluster topology:

kubeadm-ha-topology-external-etcd

Referring to the above architecture diagrams, we can see that the deployment method of the etcd cluster affects the scale of the Kubernetes cluster. In actual practice, since the purchased machines are high-performance large-memory blade servers, the business department expects these resources to be fully utilized and does not expect these machines to run cluster control management components.

When encountering this situation, many deployment solutions will use the first solution, which is to put the host nodes, worker nodes, and etcd clusters together to reuse resources. From the perspective of high availability architecture, a high density of applications does not necessarily bring unlimited benefits to users. Imagine the hidden dangers of this architecture when a node goes down, it will greatly affect the business application. Therefore, the general experience of high availability architecture is that the worker nodes must be deployed separately from the control nodes. In a virtualized mixed environment, it is most appropriate to use smaller virtual machines to deploy the control nodes. When your infrastructure is fully composed of physical machines, it is wasteful to directly use physical machines to deploy control nodes. It is recommended to use virtualization software to virtualize a batch of small and medium-sized virtual machines in the existing physical machine cluster to provide management resources for the control nodes according to demand.

In addition to the above standard deployment solutions, the community also provides the K3s single-node deployment mode. This binds the core components into a single binary file. The advantage of this is that there is only one system process, which is very easy to manage and restore the cluster. In a purely physical machine environment, using this single-node cluster architecture to deploy applications, we can support application high availability and disaster recovery by deploying multiple sets of clusters redundantly. The following figure shows the specific architecture diagram of K3s:

k3s-arch

K3s was originally designed to provide a lightweight Kubernetes cluster for embedded environments, but this does not prevent us from flexibly using it in production practice. K3s provides all stable versions of the native Kubernetes API interfaces and can exert the same capabilities for orchestrating container businesses in x86 clusters.

Strategy for using worker node components #

The components installed by default on the worker node are kubelet and kube-proxy. In the actual deployment process, kubelet has many configuration items that need to be optimized. These parameters will be adjusted according to business requirements and do not have completely identical configuration schemes. Let’s take another look at the kubelet component. For kubelet, it is a control management process used to start Pods. Although kubelet overall performs the task of starting containers, the specific operations it performs depend on the container engine at the host level. For the container engine it depends on, we can choose components like containerd, CRI-O, and so on. The component configured by Kubernetes by default is CRI-O. However, the most commonly deployed component in actual practice is containerd because of its enormous deployment volume, and many potential problems are resolved promptly. containerd is a container management tool extracted from the Docker engine. Users have long-term experience using it, which brings a lot of potential confidence in using it for operation and management of containers.

containerd-plugin For the maintenance of container instances, the commonly used command-line tool is Docker. After switching to containerd, the command-line tools are switched to ctr and crictl. Many times, users are confused about the uses of these two tools and confuse them with Docker.

Docker can be understood as the most comprehensive development and management tool for running containers on a single machine, and this needs no further introduction as it is widely known. ctr is the client-level command-line tool for containerd, and its main capability is to manage running containers. crictl is a tool for managing the Container Runtime Interface (CRI) runtime environment and is the component that operates cri-containerd as shown in the diagram. Its functionality mainly focuses on image loading and running at the Pod level.

Please pay attention to the differences in implementation details among Docker, ctr, and crictl. For example, both Docker and ctr are indeed used to manage images and containers at the host level, but they have their own independent management directories. Therefore, even if the operation of loading images is the same, the file location in host storage is different, and their image layers cannot be reused. On the other hand, crictl operates at the Pod level and doesn’t directly manipulate images. It generally sends commands to the corresponding image management program, such as the containerd process.

Another component is kube-proxy, which is a north-south reverse proxy service oriented towards the Service object concept. By integrating with Endpoint objects, traffic can be load balanced according to load balancing strategies. In order to achieve a cluster-wide service discovery mechanism, each service is assigned a globally unique name, which is the name of the Service. This name can be resolved within the cluster using the additional component, coredns, which provides service discovery. For traffic load balancing, Kubernetes uses iptables or IPVS (IP Virtual Server) to implement it.

In a normal cluster scale, the number of Services does not exceed 500. However, Huawei’s container technology team conducted an extreme stress test and discovered performance bottlenecks in iptables when implementing reverse proxy. The experiment confirmed that when the number of Services increases to a sufficiently large scale, the complexity of Service rules in iptables becomes O(n), while switching to IPVS reduces it to O(1). The results of the stress test are shown below:

ipvs-iptables-perf

Currently, the default activated mode for the Kubernetes reverse proxy module is IPVS. Both iptables and IPVS are submodules of netfilter based on Linux, and their commonality lies in their function as reverse proxies. However, there are still three differences that need to be known:

  • IPVS provides scalability and high performance for large-scale clusters.
  • IPVS provides more diverse load balancing algorithms (such as least load, least connections, locality-based scheduling, weighted round-robin, etc.).
  • IPVS supports server health checks and network retry mechanisms.

When IPVS is dealing with packet filtering, SNAT, and masquerading requirements, it still needs to use iptables’ extension package tool, ipset, to configure a fixed number of transformation rules. It does not linearly write rules with the increase of services and Pods like in the iptables mode, which would increase the compute CPU load and affect the processing capacity of the cluster.

The following table lists the ipset rules that are required for the IPVS mode:

set name members usage
KUBE-CLUSTER-IP All service IP + port Mark-Masq for cases that masquerade-all=true or clusterCIDR specified
KUBE-LOOP-BACK All service IP + port + IP masquerade for solving hairpin purpose
KUBE-EXTERNAL-IP service external IP + port masquerade for packages to external IPs
KUBE-LOAD-BALANCER load balancer ingress IP + port masquerade for packages to load balancer type service
KUBE-LOAD-BALANCER-LOCAL LB ingress IP + port with externalTrafficPolicy=local accept packages to load balancer with externalTrafficPolicy=local
KUBE-LOAD-BALANCER-FW load balancer ingress IP + port with loadBalancerSourceRanges package filter for load balancer with loadBalancerSourceRanges specified
KUBE-LOAD-BALANCER-SOURCE-CIDR load balancer ingress IP + port + source CIDR package filter for load balancer with loadBalancerSourceRanges specified
KUBE-NODE-PORT-TCP nodeport type service TCP port masquerade for packets to nodePort(TCP)
KUBE-NODE-PORT-LOCAL-TCP nodeport type service TCP port with externalTrafficPolicy=local accept packages to nodeport service with externalTrafficPolicy=local
KUBE-NODE-PORT-UDP nodeport type service UDP port masquerade for packets to nodePort(UDP)
KUBE-NODE-PORT-LOCAL-UDP nodeport type service UDP port with externalTrafficPolicy=local accept packages to nodeport service with externalTrafficPolicy=local

Additionally, IPVS mode will fall back to using iptables mode in the following scenarios:

  • When kube-proxy is started with the --masquerade-all=true flag.

  • When kube-proxy is started with the Pod network CIDR specified.

  • Service of Load Balancer type

  • Service of NodePort type

  • Service with externalIP configured

Usage Policy for Worker Node Add-ons #

When it comes to add-ons, it is generally common sense that these components are optional and only provide additional capabilities. However, in a Kubernetes cluster, these add-ons are essential and must be installed; otherwise, the entire cluster would be of little value. Kubernetes officially categorizes these add-ons into the following five types:

  • Networking and Network Policies
  • Service Discovery
  • Visualization and Management
  • Infrastructure
  • Legacy Components

Looking at the titles, one can almost understand the purpose of these components. Here, I will reintroduce these components from a practical perspective and provide valuable insights for future usage.

1. Networking and Network Policies

When we talk about networking, we are mainly referring to container networking. It is important to note that there are two layers of virtual networks within a Kubernetes cluster. Just mentioning virtual networks already implies the issue of packet loss, which was unimaginable in traditional virtualized environments. In order to improve or even circumvent these troublesome issues, we must discard all official solutions and resort to traditional network solutions. Of course, most traditional network solutions were not designed for Kubernetes, so they require a lot of custom adaptations to enhance the experience. Apart from suboptimal traditional solutions, the most popular container networking solutions include Calico, Cilium, Flannel, Contiv, and others. After adopting these solutions, as business traffic increases, it will inevitably lead to packet loss situations. The problem caused by packet loss is a decrease in the processing capability of business instances. To restore the processing capability of these instances, the standard practice is to horizontally scale the number of container instances. It is worth noting that increasing the number of instances actually improves the business processing capability, which leads operators to overlook the performance impact caused by container networking. Additionally, Kubernetes has also taken into consideration the requirements of mainstream network management practices and introduced Network Policies. These policies define the connectivity relationships between Pods, facilitating secure network isolation for containerized business groups. However, in my practical experience, I found that these policies are completely dependent on the implementation capabilities of container networking, making them reliant and experimental. So far, I haven’t seen any actual advantages in real-world business scenarios.

2. Service Discovery

Currently, the capability provided is to offer DNS services to Pods and introduces rules for defining domain names. The only officially recognized service discovery option is CoreDNS. It is worth noting that this service discovery can only be used within the cluster. It is not recommended to expose it directly to external services. Instead, the cluster exposes services to the outside world using IP and port. External DNS can flexibly specify this fixed IP for global service discovery.

3. Visualization and Management

Kubernetes provides the Dashboard, which is the official standard web interface for managing and administrating a cluster. Many development and integration testing environments use it to meet business management needs. This component is optional for installation.

4. Infrastructure

Kubernetes provides KubeVirt, an add-on component that allows Kubernetes to run virtual machines. It runs on bare-metal clusters by default. From the current practical experience, this capability is still considered experimental, and few people use it.

5. Legacy Components

For many older versions of Kubernetes, there are a variety of legacy components available for use. Therefore, the official preservation of these optional components helps users maintain the capabilities of previous clusters during the process of migrating to newer versions. However, few people actually use these components.

Through the introduction from these three perspectives, I believe that everyone now has a deeper understanding of the core components of Kubernetes. In practical production scenarios, in order to standardize the operations and maintenance model, we can define a baseline model for Kubernetes components based on business needs and selectively use these components. I am confident that this approach can help avoid many compatibility issues. In most of the Kubernetes cluster failure cases I have encountered, the problems were mostly caused by incorrect usage or improper application of these components, which made the problems more complicated and difficult to reproduce.

Of course, cloud-based Kubernetes can completely resolve the baseline issues. I believe that in the future, users will find it increasingly easier to use reliable Kubernetes cluster environments. Just remember, we are simply handing over the difficulty of operating and maintaining Kubernetes to professional cloud service developers.