20 Data Engines Unified Caching Data Platforms

20 Data Engines Unified Caching Data Platforms #

Hello, I am Xu Changlong.

In the previous four chapters, we have learned how to optimize different types of systems and identified key technical points. However, apart from this foundation, it is important to understand how large-scale Internet companies design and support high-concurrency systems. Therefore, in this chapter, I have selected a few cases to broaden your horizons and explore practical designs for intranet services.

Every internet company has several core profitable businesses. We often provide value-added services for these core businesses to expand our service scope and build an industrial chain and ecosystem. However, these value-added services require the data and interaction of core projects to provide better services.

If the core system is too tightly coupled and adapted to the value-added business system, it will make the business system very complex. How can we enable value-added services to access the resources of the core system while reducing the coupling between systems?

In this lesson, I will focus on introducing a middleware that supports active caching in the intranet. With this middleware, high-performance entity data access and cache updates can be easily implemented.

Review of Temporary Cache Implementation #

Let’s review the temporary cache implementation shown earlier, which is excerpted from the second lesson.

// Try to retrieve user information directly from the cache
userinfo, err := Redis.Get("user_info_9527")
if err != nil {
  return nil, err
}

// If cache hit, directly return the user information
if userinfo != nil {
  return userinfo, nil
}

// Cache miss, retrieve from the database
userinfo, err := userInfoModel.GetUserInfoById(9527)
if err != nil {
  return nil, err
}

// User information found
if userinfo != nil {
  // Cache the user information and set a TTL timeout to expire it after 60 seconds
  Redis.Set("user_info_9527", userinfo, 60)
  return userinfo, nil
}

// Not found, put an empty data to avoid accessing the database within a short period
// Optional, this is used to prevent cache penetration query attack
Redis.Set("user_info_9527", "", 30)
return nil, nil

The above code demonstrates a common way to improve read performance using temporary cache. It directly looks up the user information from the cache by ID. If it’s not found in the cache, it retrieves the data from the database. Once the data is found, it is written to the cache for future queries.

Although this implementation is relatively simple, it can be a labor-intensive task if we have to write this kind of code for all our business logic.

Even if we encapsulate this kind of implementation, the encapsulated functionality is not very generic in statically-typed languages and may not perform well. So, is there a way to solve this kind of problem uniformly and reduce our workload?

Active Caching for Entity Data #

In a previous lesson, we discussed that entity data is the easiest to cache, and the cache key for entity data can be designed as prefix+primary key ID. With this design, we can directly retrieve entity data from the cache as long as we have the ID of the entity.

To reduce duplicate work, we can extract this process into a middleware, as shown in the diagram below:

Combining with the diagram above, let’s analyze how this middleware works. We use canal to monitor the binlog logs of the MySQL database, and when there is a data change, the message listener will receive a notification of the change.

Because the change message contains the table name and all the primary key IDs of the changed data, we can use these primary key IDs to query the latest entity data from the database master and process the data as needed. Then, we can push the data to the cache.

Based on past experience, newly changed data often has a high probability of being immediately read. Therefore, this implementation will have a good cache hit rate. At the same time, when our data is cached, it will be given a TTL according to the configuration. If the data has not been read for a period of time, it will be evicted by the LRU strategy, which can save cache space.

If you think carefully, you will find that there is still a flaw in this design: if the business system cannot retrieve the required data from the cache, it needs to query the database and put the data back into the cache. This is inconsistent with our original intention. Therefore, we also need to have a cache query service, as shown in the diagram below:

As shown in the diagram above, when we search the cache and cannot find the data, the middleware will identify which table the data belongs to and which processing script to execute by the Key. Then, the script will be executed according to the configuration to query the database, process the data, and the middleware will fill the obtained data back into the cache and return the results.

To improve query efficiency, it is recommended to use a pure text long connection protocol, similar to Redis, for the query service. Batch retrieval functions, such as Redis’s mget, should also be supported. If our data support architecture is complex and the amount of data queried at one time is large, batch concurrent processing can be implemented to improve system throughput performance.

There are some practical techniques for implementing cache services, let’s take a look together.

If the data does not exist in the cache when querying, it will cause the problem of cache penetration, and the database may crash due to a high volume of requests. To prevent this type of problem, we need to add a special flag in the cache so that when the query service cannot find the data, it will directly return that the data does not exist.

We also need to consider how to limit the concurrency of the database in case of cache penetration. It is recommended to use SingleFlight to merge parallel requests instead of using global locks. The implementation should be done within the scope of each service.

Sometimes, the data we need to query is distributed in multiple tables in the database, and we need to combine the data from multiple tables or refresh multiple caches. Therefore, our cache service needs to provide customized scripts to achieve business data refreshing.

In addition, since this is synchronization between the database and the cache, to better troubleshoot cache synchronization issues, it is recommended to record the last update time of the data in both the database and the cache for future comparison.

With this, our service is basically complete. When the business needs to search for data by ID, we can directly call the data middleware to obtain the latest data, without having to implement it repeatedly, making the development process much simpler.

L1 Cache and Hotspot Cache Delay #

The cache middleware we designed above is already capable of handling most scenarios that require temporary caching. However, in the case of high concurrency queries, if the cache is missing or expired, it will put a lot of pressure on the database. Therefore, further improvements to this service are needed.

The improvement approach is to count the number of queries and determine whether the queried key is a hotspot cache. For example, by asynchronously counting the number of times the cache key is accessed within a 5-minute time block, if it exceeds a certain number within a unit of time (set according to business requirements), it is considered a hotspot cache.

The specific process for hotspot cache statistics and renewal is shown in the following diagram:

By comparing the process diagram, it can be seen that after the hotspot statistics service identifies a key as a hotspot, it distinguishes them based on the number of times they were accessed. If it is a key with very high frequency, it will be periodically pushed from the script to the L1 cache (the L1 cache can be deployed on each business server, or several business servers can share one L1 cache).

When the business queries data, the query SDK driver will check if the current key is a hotspot key based on the hotspot key configuration. If it is, it will retrieve the data from the L1 cache. If it is not a hotspot cache, it will retrieve the data from the cluster cache.

As for the hotspots with relatively high frequency, the hotspot cache service will only periodically notify the query service to refresh the corresponding key or perform TTL refresh and renewal operations.

When the data we are querying becomes less popular, the access statistics for our data time blocks will decrease. At this time, the L1 hotspot cache push or TTL renewal will stop, and the data will expire shortly after.

After adding this feature, this cache middleware can be renamed as a data caching platform. However, there are still some differences between it and a real platform, because this platform can only provide caching for entity data and cannot flexibly process and push data. Some business structure code still needs to be implemented manually.

Relationship Data Caching #

As you can see, our current caching is limited to entity data caching and does not support caching for relational databases.

To address this, we first need to improve the message listening service and turn it into a Kafka Group Consumer service, while also implementing dynamic scalability. This will enhance the system’s parallel data processing capability and support larger amounts of concurrent modifications.

In addition, for higher volume data caching systems, we can also introduce multiple data engines to provide different data support services, such as:

Lua script engine (for a review, please refer to Lesson 17) serves as the “engine” for data push, helping us dynamically synchronize data to multiple data sources.
Elasticsearch is responsible for providing full-text search functionality.
Pika is responsible for providing high-capacity KV query functionality.
ClickHouse is responsible for providing real-time query data aggregation and statistics functionality.
MySQL engine is responsible for supporting data queries for new dimensions.

Did you notice that we have mentioned all of these engines in previous lessons? The only one you might find unfamiliar is Pika, but it’s not that complicated and can be understood as an enhanced version of RocksDB.

I haven’t gone into detail about each engine here, but have summarized their respective strengths. If you are interested in delving deeper, you can explore on your own and see which engines are suitable for different business scenarios.

Multi-Data Engine Platform #

An ideal multi-data engine platform is quite extensive and requires a lot of manpower to build. It can provide us with powerful data querying and analysis capabilities, and it is easy to integrate, greatly improving our business development efficiency.

To give you a holistic understanding, I have specially drawn an architecture diagram of the multi-data engine platform to help you understand the relationship between the data engine, cache, and data updates. The diagram is shown below:

As you can see, the basic data service has been developed into a platform. When MySQL data is updated, it will be filtered and pushed into different engines through the subscribed change messages, providing services such as data statistics, big data KV, in-memory caching, full-text search, and heterogeneous MySQL data querying.

When specific businesses need to access core business basic data, they need to apply for data access authorization from this platform. If there are special requirements, Lua scripts for data processing can be submitted to the platform. High-traffic businesses can even apply for an independent deployment of a data support platform.

Summary #

In this lesson, we learned about the implementation solution for a unified data caching platform. With this middleware, the development efficiency will be greatly improved. Before using the data support component, businesses had to implement their own cache and multi-data source synchronization, which required a lot of duplicated logic for cache refreshing, as shown in the following figure:

After using the data caching platform, we saved a lot of manually implemented work. Developers only need to configure it in the platform to enjoy the powerful multi-level cache function provided by the middleware and the data query service provided by various data engines, as shown in the following figure:

Let’s review the working principle of this middleware. First, we subscribe to the binlog of the MySQL database using Canal to obtain the data change messages. Then, the caching platform triggers cache updates based on the subscribed change information. Additionally, by combining the client SDK and cache query service, we can identify hot data and provide multi-level cache services.

It can be said that data is the heart of our system, and with a powerful data engine, we can do more. The biggest feature of the data support platform is its integration of our data with various data engines, thereby achieving more powerful data service capabilities.

Core systems of large companies usually use a combination of multiple engines to provide data support services. In some cases, the server side of these services only needs to be configured to obtain these functions. This makes business implementation more lightweight and creates broader value-added opportunities for the business.

Thought Questions #

When using a Bloom Filter to reduce L1 cache queries, how are the hash lists of the Bloom Filter updated to the client?

Feel free to leave a message in the comments section to discuss and exchange ideas. See you in the next class!