14 the Use of Cache Scenario Ii How to Achieve High Availability of Cache

14 The Use of Cache Scenario II - How to Achieve High Availability of Cache #

Hello, I’m Tang Yang.

In the previous lessons, I introduced you to the principles of caching, its classifications, and the techniques for using popular caches. We started using caching to handle most of the read pressure, relieving the query pressure on the database and improving performance while ensuring system stability. At this point, the overall architecture of your e-commerce system looks like the following diagram:

We have added a caching layer between the web layer and the database layer. Requests will first query the cache, and only when the cache does not have the required data will they query the database.

Here, you need to pay attention to the cache hit rate (cache hit rate = number of requests hitting the cache / total number of requests). Generally, in your e-commerce system, the hit rate of the core cache needs to be maintained at 99% or even 99.9%. Even a 1% decrease can have a devastating impact on the system.

This is not an exaggerated claim. Let’s calculate it. Assuming your system’s QPS is 10,000/s, each invocation will access the cache or database 10 times. So, when the cache hit rate decreases by just 1%, the database will receive an additional 10,000 * 10 * 1% = 1,000 requests per second. Typically, the peak read request volume of a single MySQL node is around 1,500/s, and these additional 1,000 requests can likely cause significant impact on the database.

The impact caused by a mere 1% decrease in hit rate is so terrifying, let alone cache node failures. The single-node deployment of the cache node shown in the diagram becomes the biggest hidden danger in the entire system. So, how do we solve this problem and improve the availability of the cache?

We can deploy multiple nodes and design some solutions to make these nodes backup each other. This way, when one node fails, its backup node can take over and continue to provide services. These solutions are the focus of this lesson: high availability solutions for distributed caching.

In my project, I mainly choose solutions in three categories: client-side solutions, intermediary proxy solutions, and server-side solutions.

The client-side solution involves configuring multiple cache nodes on the client side and using caching write and read algorithm strategies to achieve distributed caching and improve cache availability.
The intermediary proxy solution involves adding a proxy layer between the application code and the cache nodes. All client write and read requests go through the proxy layer, which incorporates high availability strategies to help improve cache system availability.
The server-side solution refers to the Redis Sentinel solution proposed after Redis version 2.4.

Mastering these solutions can help you counter the impact of cache node failures causing a decrease in cache hit rate, and enhance the robustness of your system.

Client-side Solution #

In client-side solutions, you need to pay attention to both writing and reading data from the cache:

When writing data, the data being written to the cache needs to be distributed among multiple nodes, which is known as data sharding.
When reading data, you can use multiple sets of caches for fault tolerance, improving the availability of the cache system. For reading data, you can use either master-slave or multi-replica strategies, which are proposed to solve different problems.

Now let’s take a detailed look at how to implement these.

1. How to Shard Cache Data

A single cache node is limited by machine memory, network bandwidth, and the volume of requests it can handle concurrently. Therefore, we consider sharding data and distribute it to multiple nodes using a shard algorithm, with each node storing a portion of the data.

By doing this, if one node fails, the others can still provide services, ensuring a certain level of availability. It’s like not putting all your eggs in one basket. This way, if one basket falls and breaks, the eggs in the other baskets are still intact.

Generally, the common shard algorithms are the Hash Shard Algorithm and the Consistent Hash Shard Algorithm.

The Hash Shard algorithm involves calculating a hash value for the cache key and taking the modulo of the total number of cache nodes. Here’s an example to help you understand it:

Let’s say we have deployed a cache cluster with three cache nodes. When new data needs to be written, we first use a hash algorithm like CRC32 to generate a hash value for the cache key. Then we calculate the hash value modulo 3 to determine which cache node the data should be stored in.

The biggest advantage of this algorithm is its simplicity and easy understanding. However, a drawback is that when adding or removing cache nodes, changes in the total number of cache nodes can make the calculated nodes change as well, resulting in cache invalidation and unavailability. Therefore, I recommend using this method only if you are not sensitive to a decrease in cache hit rate, for example, when there is another layer of caching to fall back on.

Of course, using the Consistent Hash algorithm can address the decrease in hit rate when adding or removing nodes. In this algorithm, we organize the entire hash value space into a virtual ring and place the IP addresses or hostnames of cache nodes on this ring after hashing. When we need to determine which node to store or retrieve a key from, we first hash the key using the same algorithm to find its position on the ring. Then we move clockwise on the ring and find the first cache node encountered, which is the node we should access. For example, in the image below, Key 1 and Key 2 will be stored in Node 1, Key 3 and Key 4 will be stored in Node 2, Key 5 will be stored in Node 3, and Key 6 will be stored in Node 4.

If we add a Node 5 between Node 1 and Node 2, you can see that Key 3, which used to hit Node 2, now hits Node 5, while the other keys remain unchanged. Similarly, if we remove Node 3 from the cluster, only Key 5 will be affected. As you can see, when adding or removing nodes, only a small number of keys will “migrate” to other nodes, while the majority of keys will still hit the same nodes, ensuring that the hit rate does not drop significantly.

However, everything has two sides. Although this algorithm has little impact on hit rates, it still has some issues:

The uneven distribution of cache nodes on the ring can cause certain cache nodes to be under more pressure. When a node fails, all the requests it was handling will be shifted to another node, increasing the load on the latter node.
The Consistent Hash algorithm has the issue of dirty data. In extreme cases, for example, when three nodes A, B, and C are responsible for the overall access, with each node having an equal amount of traffic, if node A fails, node B will take on double the load (all requests from A and B). When B cannot handle the traffic and crashes, C will also crash because it has to handle three times the original traffic, resulting in a cache system avalanche.

You may find this frightening, but don’t worry too much because as programmers, we are creative in solving various problems, so you can introduce the concept of virtual nodes in the consistent hash algorithm.

It calculates multiple hash values for a cache node and distributes them to different locations on a circular ring. This achieves both data balance and, when a node fails or exits, evenly distributes the keys it used to handle to other nodes, thus avoiding avalanches.

Next is the issue of dirty data in the consistent hash algorithm. Why does dirty data occur? For example, imagine a cluster with two nodes, A and B. The client initially writes a cache data with a key “k” and a value of 3 to Cache A. If the value of k needs to be updated to 4, but there is a temporary connection problem between the client and Cache A, the update request will be written into Cache B. Later, when the connection between Cache A and the client is restored and the client wants to retrieve the value of k, it will retrieve the dirty data 3 from Cache A instead of the value 4 in Cache B.

Therefore, when using the consistent hash algorithm, it is essential to define a cache expiration time, so that when drift occurs, the previously stored dirty data may have expired, which reduces the chance of dirty data being present.

Obviously, the greatest advantage of data sharding is to alleviate the storage and access pressure on cache nodes, but it also complicates cache usage. In the case of MultiGet (batch retrieval) scenarios, the access load per node does not decrease. Moreover, having too many nodes can cause the cache access Service Level Agreement (SLA) (SLA represents the availability of a website service) to not be well guaranteed because, according to the “bucket principle”, SLA depends on the slowest and worst-performing nodes. Additionally, having too many nodes will increase the probability of encountering problems. Therefore, I recommend having 4 to 6 nodes as the optimal range.

2. Master-Slave Mechanism in Memcached

Redis itself supports master-slave deployment, but Memcached does not. So today, let’s mainly understand how Memcached implements the master-slave mechanism on the client side.

In a previous project, I encountered a problem of data penetration caused by a single master node failure. To solve this, I configured a group of slaves for each group of masters, and updated the data synchronously between them. When reading, the data is retrieved from the slave first. If there is no data, it penetrates to the master to read the data and stores it back in the slave to keep the slave’s data hot.

The biggest advantage of the master-slave mechanism is that when a slave fails, there is still a master as a backup, so there won’t be a situation where a large number of requests penetrate to the database, which improves the high availability of the cache system.

3. Multiple Replicas

In fact, the master-slave mode can already solve most of the problems, but in scenarios with extreme traffic, a group of slaves usually cannot fully handle all the traffic, and the network card bandwidth of the slaves may become a bottleneck.

To solve this problem, we consider adding a replica layer before the master/slave layer. The overall architecture is as follows:

In this solution, when a client sends a query request, it first selects one replica group out of multiple replica groups to initiate the query. If the query fails, it continues to query the master/slave, and the query result is stored in all replica groups to avoid dirty data in the replica groups.

Due to cost considerations, the capacity of each replica group is smaller than that of the master and slave, so it only stores hotter data. In this architecture, the traffic to the master and slave is greatly reduced. To ensure the data heat stored by them, in practice, we use the master and slave as one replica group.

Middle Proxy Layer Solution #

While the client-side solution can solve most of the problems, it can only be reused between systems of the same language. For example, if Weibo implemented the logic using Java, it would be cumbersome for me to reuse it using PHP. The middle proxy layer solution can solve this problem. You can transfer the experience of the client-side solution to the proxy layer and achieve reuse in other languages through a common protocol like the Redis protocol.

If you come from the research caching proxy layer, you can encapsulate the high availability logic from the client-side solution inside the proxy layer code. This way, users who use your proxy layer don’t need to worry about how the cache’s high availability is done, they only need to rely on your proxy layer.

In addition, the industry also has many middle proxy layer solutions, such as Facebook’s Mcrouter, Twitter’s Twemproxy, and Wandoujia’s Codis. Their principles can be summarized by the following diagram:

Do you notice anything from this diagram? All of the read and write requests for the cache are handled by the proxy layer. The proxy layer is stateless and mainly responsible for the routing of read and write requests. It also contains built-in high availability logic, with different open source middle proxy layer solutions using different high availability strategies. For example, in Twemproxy, the Proxy guarantees that if a Redis node fails, it will remove it from the cluster and subsequent requests will be handled by other nodes. In Codis, the implementation is slightly more complex. It provides a tool called Codis Ha to automatically promote a replica node to the master node. Starting from version 3.2, it switches to Redis Sentinel mode to achieve high availability of Redis nodes.

Server-side Solution #

In version 2.4, Redis introduced the Redis Sentinel mode to address the high availability issue in the deployment of Redis master-slave setup. It automatically promotes a slave node to a master node when the master node fails, ensuring the overall availability of the cluster. The overall architecture is shown in the diagram below:

Redis Sentinel is also deployed in a clustered manner to avoid situations where the failure of a single Sentinel node leads to the inability to automatically recover from failures. Each Sentinel node is stateless. In Sentinel, the address of the master is configured, and the Sentinels continually monitor the status of the master. When the master fails to respond within the configured time interval, the Sentinels consider it as a failure. They then select a slave node to become the new master and assign all other slave nodes to the new master. During arbitration within the Sentinel cluster, the consensus on the state of the cache nodes needs to be reached based on the configured value to determine the number of Sentinel nodes that agree on switching from the failed master to the new master.

Redis Sentinel does not belong to the proxy layer mode, as the write and read requests for the cache do not pass through the Sentinel nodes. The Sentinel nodes are at the same level as the master-slave setup and exist as administrators. Therefore, it can be considered as a high availability solution provided on the server side.

Course Summary #

That’s all for today’s sharing. Let’s review the key points together:

There are three main high availability solutions for distributed caching. The first one is the client-side solution, commonly known as the Smart Client. By defining data sharding and data read-write strategies, we can achieve cache high availability. The advantage of this solution is that there is no performance loss, but the disadvantage is that the client logic is complex and cannot be reused in a multi-language environment.

Secondly, the middle proxy solution adds a middle layer between the client and the cache nodes, which may result in some performance loss. In the proxy layer, there are some built-in high availability solutions, such as Codis’s use of Codis Ha or Sentinel.

Finally, the server-side solution relies on the implementation of components. Memcached only supports single-machine version without distributed and HA solutions, while Redis provides a Sentinel solution since version 2.4, which can automatically perform master-slave switching. The server-side solution will add some complexity in operation and maintenance.

Overall, the three solutions of distributed caching each have their own advantages. Some teams may have accumulated experience on the Smart Client during the development process; some teams may have rich experience in Redis operation and maintenance, so they can promote the Sentinel solution; some teams may have accumulated some storage development experience, so they can promote the middle proxy solution or even develop their own proxy components suitable for their business scenarios. The specific choice still depends on the actual situation of the team.