16 Local Cache What Pitfalls Might Local Caching Encounter When Used as a Service

16 Local Cache What Pitfalls Might Local Caching Encounter When Used as a Service #

Hello, I’m Xu Changlong.

In this chapter, we will learn how to deal with systems that have high read and write loads. Weibo feeds, online games, IM, online classrooms, and live streaming all fall into the category of systems with high read and write loads. Many technologies in these systems are considered top-level in the industry, as even a slight issue online can greatly affect user experience.

Speaking of systems with high read and write loads, we cannot avoid talking about caching, as currently, only caching can provide data services for high traffic. Common caching architectures typically use a centralized caching method to provide services externally.

However, centralized caching has limitations in systems with high read and write loads. When traffic reaches a certain level, the excessive network overhead of centralized caching and stateless services becomes increasingly severe. This leads to high costs and instability of caching in high-concurrency read and write scenarios.

To reduce costs and save resources, we add another layer of caching at the business service layer, sacrificing strong consistency and maintaining eventual consistency. This helps to reduce the read and write pressure on the core caching layer.

Virtual Memory and Page Fault Interrupt #

In order to implement a good business layer cache, we need to first understand how the operating system manages memory at a low level.

By comparing the following C++ code, you can take a moment to think about what the memory address of the variable will be if the program is started multiple times under unchanged conditions:

int testvar = 0;
int main(int argc, char const *argv[])
{
  testvar += 1;
  sleep(10);
  printf("address: %x, value: %d\n", &testvar, testvar );
  return 0;
}

The answer may surprise you. If you try it out, you will find that the memory address output of the variable is always fixed, which demonstrates that the memory seen by the program is isolated. If our service accessed physical memory, this would not happen.

Why is the result like this? This brings us to the memory management method of Linux, which manages memory using virtual memory, so each running process has its own virtual memory space.

Looking back, when we provide a cache data service to the outside world, if we want to provide more efficient concurrent read and write services, we need to store the data in local memory, generally implemented as multiple threads within a process that share cache data. However, in this process, we will also encounter page fault issues, let’s take a look together.

As shown in the above figure, the memory allocated by our service in Linux is not immediately divided from physical memory. The system will only find that physical memory is not allocated when the system data is modified. At this time, the CPU will generate a page fault interrupt, and the operating system will allocate physical memory to the program in units of pages. The system is designed this way mainly to reduce memory fragmentation and minimize memory waste.

However, the pages allocated by the system are very small, generally 4KB. If we need to insert 1GB of data into memory at once, writing data to this memory will frequently trigger page fault interrupts, leading to slow program response and unstable service status.

Therefore, when we confirm that we need high-concurrency read and write access to memory, we will first allocate a large block of memory and fill it with zeros before using it. This can reduce the large number of page fault interrupts generated during data insertion. I would also like to add a note that this operation of allocating large memory and filling it with zeros is very slow, so it is best to do it when the service starts.

Although the operations mentioned earlier have an immediate effect, there may still be issues when resources are tight. In reality, many services will allocate several GB of memory as soon as they start, but the actual amount of active memory used during runtime is less than 10%. Linux will move the data that we have not accessed for a long time out of memory based on statistics, freeing up space for other active memory to use. This operation is called Swap Out.

In order to reduce the probability of Swap Out, it is necessary to provide sufficient memory space and system resources for the memory caching service, allowing it to provide services to the outside world in a relatively dedicated system space.

But we all know that memory space is limited, so we need to carefully plan the amount of data in memory and make sure that these data will be frequently accessed. We also need to control the amount of cache consumed in the system because when system resources are tight, services that consume more resources will be killed preferentially in order to prevent memory wastage. We need to use the LRU (Least Recently Used) algorithm to eliminate some infrequently accessed data, in order to ensure that resources are not wasted.

Even with these measures, there may still be loopholes because the business situation is unpredictable. Therefore, it is recommended to periodically scan and warm up the memory in order to prevent a large number of page fault interrupts from being triggered and causing service stalls and eventual crashes when there is a sudden increase in traffic.

Program Container Lock Granularity #

In addition to ensuring that cold data is not stored in memory, we also need to lock the public data stored in memory. Without mutual exclusion locks, there can be inconsistent modifications by multiple threads.

If there is frequent reading and writing, we often add a single data lock or map lock to the corresponding struct. However, you should be aware that if the lock granularity is too large, it will affect the performance of our service.

Because the actual situation often differs from our expectations, I recommend that you perform multiple local stress tests when using it specifically. For example, I have previously written some memory services in C++11 and encountered situations where the performance of read-write locks was not as good as that of spin mutex locks, as well as cases where compressed transmission efficiency was not as high as uncompressed efficiency.

Now let’s take a look at the common locking methods for business caches.

To reduce lock conflicts, the approach I often use is to split a frequently modified map that contains a large amount of data into 256 or even more shards. Each shard will have a mutex lock, reducing lock conflicts and improving concurrent read/write capabilities.

In addition, there is another approach: performing modifications, reads, and other changes through only one thread. This can reduce lock conflicts and enhance execution efficiency. Redis, which we often use, implements this approach. The following diagram shows the basic idea:

If we can accept a full update every half hour or hour, we can create a map and update the data through replacement.

The specific approach is to use two pointers to point to two maps. One map is used for external services, and when we receive an offline update data package, the other pointer points to the map that loads the offline full data. After the loading is completed, the two map pointers will be swapped, achieving batch updates of the data. This way, we can improve performance without adding mutex locks.

Of course, there are also some lock-free technologies in the industry that can reduce lock contention. These methods include atomic operations, Go’s sync.Map, sync.Pool, and Java’s volatile. If you are interested, you can search for relevant knowledge related to the language you are using. In addition, you can also take a look at the lock-free implementation of MySQL’s InnoDB MVCC.

GC and Data Usage #

Is it perfect to directly put our data struct into a container like map when used as a cache? In fact, I don’t recommend doing so. This answer may overturn your perception, but you will understand after reading the analysis below.

When we put tens of thousands or even more data into the cache, the garbage collector (GC) of the programming language will periodically scan these objects to determine if they can be reclaimed. This mechanism leads to slower GC speed as the number of objects in the map increases.

Therefore, many languages have made many special optimizations to be able to store business cache data in memory. This is also why advanced languages rarely put data objects in a large map when providing caching services.

Let me take Go language as an example to explain. In order to reduce the number of objects scanned, Go has made a special mark on maps. If there are no pointers in the map, the GC will not traverse the objects it stores.

To make it easy to understand, let me give an example. Instead of storing specific object data in maps, we can use a simple structure as a query index, such as using map[int]int, where the key is string converted to int through a hash algorithm, and the value stores the offset and length of the data.

After serializing the data, we will save it in a long byte array and cache the data in this way. However, it is difficult to delete or modify the data in this implementation, so usually only the map index records are deleted.

This also leads us to handle caching based on the characteristics of the cached data.

If our data volume is small and the characteristic is read and write intensive (meaning frequent changes), it is more reasonable to put its struct into a map for external service. If our data volume is large and the characteristic is read intensive and write light, it is more appropriate to store the data in a continuous memory and access it through offset and length.

After analyzing the issue of GC, I believe you understand the reason why advanced languages would rather put the data in a common basic service than do caching locally.

If you still want to do this, here I recommend an interesting project XMM for your reference. It is a memory management component in Golang that can evade the GC. In fact, similar components exist in other languages as well, you can explore them by yourself.

Memory Alignment #

As mentioned earlier, when data is placed in a large block of virtual memory with consecutive addresses, there are some improvements that can be made to the approach of accessing it using the offset and length.

Before discussing optimization strategies, we need to first understand memory alignment. Many computer languages pay close attention to this because aligning memory offers many advantages. For example, if all the data in our array has the same length, we can quickly locate a specific element.

Let’s take an example. If I want to quickly find the 6th object in an array, I can achieve that using the following formula:

sizeof(obj) * index => offset

By using this formula, it is required that our struct must have a fixed length and the length should be aligned to a power of 2. Additionally, variable-length fields can be pointed to another memory space using a pointer.

By using this approach, we can directly locate the object in memory based on the index, and its length is fixed, so we don’t need to record the length. We just need to use the index to find the data.

This design enables us to quickly access the whole memory page where the data is stored when retrieving memory data. We can then efficiently search for the indexed data in memory without the need to read multiple memory pages. After all, accessing memory also belongs to accessing external storage, so minimizing the number of accesses is more efficient. This way of accessing memory by page not only allows fast access but also improves the chances of CPU L1 and L2 cache hits.

SLAB Memory Management #

In addition to the above methods, you may be curious about how basic memory services manage memory. Let’s take a look at the following design.

As shown in the figure above, to reduce system memory fragmentation and improve memory allocation efficiency, most mainstream languages have implemented memory management algorithms similar to Memcache’s buddy algorithm, and even some memory management libraries in high-level languages are implemented in this way.

Let me give you an example. Redis can choose to use jmalloc to reduce memory fragmentation. Let’s take a look at the implementation principle of jmalloc.

jmalloc will allocate a large block of memory at once, and then divide it into multiple groups. To accommodate our memory usage needs, each group will be split into chunks of the same size, with the size of each group gradually increasing. For example, the first group is all 32 bytes, and the second group is all 64 bytes.

When data needs to be stored, jmalloc will search the free block list and allocate it to the caller. If the data to be stored is not found in a free block of the same size, a larger block will be allocated. Although this approach may waste some memory, it can greatly reduce memory fragmentation and improve memory utilization.

Many high-level languages also use this implementation approach. When the local memory is not sufficient, our program will allocate another large block of memory to continue serving. This means that unless we restart the service, even if we release the temporarily allocated memory in our business code, the programming language will not really release the memory. Therefore, if we encounter temporary large memory allocations, we must consider whether it is worth doing so.

Summary #

After completing this lesson, you should understand why in the industry we are all trying to avoid using business service caching to deal with situations of high concurrency read and write.

When implementing this type of service, we not only need to ensure that the current service can handle high concurrency network requests, but also reduce lock contention caused by internal modifications and reads. We also need to consider various factors such as the GC principles of high-level languages, memory fragmentation, page faults, etc. At the same time, we also need to worry about data updates, consistency, and memory usage refresh.

Even in special cases where we use business layer caching, once the business becomes stable, almost everyone tries to degrade this type of service and convert it to a purely read-heavy/write-light or write-heavy/read-light service.

A more common situation is that if we have to do it, we can also consider starting a small Redis shard on the business server to cope with the online pressure. Of course, in this way, we also need to consider how to synchronize the data.

In addition to the pitfalls we discussed today, there are also other issues that we may encounter in the process of providing external services through memory, which we will delve into in the next class.

Thought Question #

When using a large array to store data and implementing data caching using offset + length, how can we modify the data?

Feel free to discuss and exchange ideas with me in the comments section. See you in the next class!