15 Selection Different Stages of Data Storage How Should It Be Stored

15 Selection Different stages of data storage- how should it be stored #

Hello, this is Jingyuan.

Today I want to share with you a topic related to “storage” design on the Serverless platform. Typically, when designing a service or platform, storage is an essential component, such as user data, log information, intermediate states, etc.

When building a production-grade FaaS-based Serverless platform, you may encounter the following questions:

Where should function definitions be stored?
How should large code packages be stored?
How can we ensure real-time awareness of scaling resources’ states?
…

In this lesson, I will talk to you about how data storage is designed in Serverless. I will address the above questions and explore the two aspects of control and data to help you understand the considerations for designing storage solutions for a function computing platform.

Through this lesson, I hope you will gain a deeper understanding of the data types for function computing storage and the appropriate storage solutions for different types of data. This will enable you to design storage solutions for Serverless platforms smoothly.

Overall thinking #

So, how should we design and select a function computing platform based on the characteristics of function computing?

I have summarized the key points and ideas for designing a function computing platform in the mind map below.

In the control plane, there are two dimensions: metadata and code package. Starting from the execution process of requests, in the data plane, we mainly need to consider the caching processing of metadata during retrieval, the scheduling processing of function Pod resources, the collaborative operations between services, and the collection of log-related information. Among them, logs and service collaboration can be considered as being present throughout the entire system. For instance, service collaboration includes service registration and discovery, message notification and coordination, master selection, etc.

Storage on the Control Plane #

Let’s start with the control plane and see what data needs to be prepared when creating a function.

Metadata #

Before creating a function, we first need a namespace. Although most function compute products provide a default namespace to omit this step, in actual business operations on the cloud, namespaces are used to manage different functions and differentiate business, which is also part of the metadata in function computing.

Next, we choose to create a function. There are quite a few basic properties of function compute. In addition to considering common function properties such as function name, entry point, timeout, memory size, runtime type that you see on the console, when it comes to storage in practice, we also need to record the creation time, modification time, and unique identifier of the function. In addition, when using gray release, we also need to consider the version information of the function. If the user has bound a trigger, we also need to record some basic information about the trigger, such as event source service, trigger rules, etc.

In addition to the auxiliary properties of the function itself, the platform usually needs to consider user-related information storage, such as user ID, permissions, and concurrency limits.

As we can see, there are quite a few metadata related to the function, involving multiple entities such as functions, triggers, users, versions, aliases, namespaces, etc., each of which contains numerous attributes.

They have complex relationships, and I believe you can also guess that at this time, we need to use a relational database for storage. It is more flexible in supporting complex operations and field updates. More importantly, you are certainly not unfamiliar with relational databases. Almost every student has started to come into contact with them at school.

But I estimate that when it comes to database selection, most students basically rely on intuition to choose the one they are most familiar with. Here, I will share with you a few common choices for databases.

This is the latest ranking given by db-engines, with the top four being relational databases.

As the top database, Oracle has been leading the ranking for many years, but it is expensive and closed source. The cloud databases of the cloud service providers we are familiar with are basically based on MySQL, SQL Server, and PostgreSQL to provide cloud service capabilities. So pay special attention to these options.

Among them, although SQL Server can also be installed and run on systems such as Linux with the release of new versions and the development of cloud-native technologies, judging from the development history of the Internet and the databases adopted by major companies’ systems, I recommend you use MySQL or PostgreSQL.

MySQL has been very mature in its development so far, and the community is also very active. Its availability and performance are very good. PostgreSQL is claimed to be the most advanced open-source relational database in the world, almost all the features of MySQL are present in it, and it is already very popular overseas, but its popularity in China is not as high as MySQL.

Regarding their comparisons and usage recommendations, you can refer to the blog post “PostgreSQL vs. MySQL: What you need to know” which points out:

Consider PostgreSQL for any application that might grow to enterprise scope, with complex queries and frequent write operations. If you’re new to the world of databases and don’t expect your application to scale up, or you’re looking for a quick tool for prototyping, then consider MySQL.

You don’t have to understand it, simply put, it actually means that if you are a newcomer to the field of databases, or your application is not expected to be large in scale, then MySQL is more suitable. Otherwise, it is recommended to use PostgreSQL. I would also like to add that if you are using a cloud service provider’s RDS database, when choosing a storage engine, you must pay attention to the specifications and prices, because the one that suits your needs is the best.

Finally, in terms of metadata storage, some serverless function compute platforms, considering storage centered around functions, also choose document databases similar to MongoDB for storage. However, from the perspective of CRUD and entity management, I would recommend using a relational database to manage metadata.

Code Package #

Next, let’s take a look at the storage design of the code package. There are generally two forms of code packages: one is to package functions written based on the cloud service provider’s code framework into ZIP, WAR, JAR packages, and so on, and the other is in the form of custom images.

Traditional compression package

When it comes to the first format of code packages, I believe that after this period of understanding cloud products, you can immediately think of the choice of “object storage” in your mind.

Usually, when we talk about data storage, we generally think of three different ways: files, blocks, and objects. I have listed their differences here:

Considering the characteristics of serverless computing, since resource instances are stateless, it is very troublesome to use a distributed file system and mount it every time. Block storage also has the same problem and requires implementing block operations manually.

On the other hand, object storage supports HTTP access and does not require mounting. The key-value feature of object storage also provides high readability to the code package path. Although it is mentioned in the table that object storage cannot directly modify the content of objects, it is usually read-intensive and write-limited for function code packages. Therefore, object storage is very suitable for storing user code.

In addition, because object storage saves each object through key-value pairs, it becomes particularly important to design a key generation rule for each function code package.

For example, you can associate user information with code information:

userId/hash(codeFile)

In this way, when uploading a code package, you only need to store the key of the function code package as an attribute of the function in the database, and then separately store the code package in the object storage service based on the key. When the code changes, the code package and the hash code will also change, so it is necessary to keep them in sync with the metadata of the database.

When starting a function instance, download the code package from the object storage using the key value through the object storage API, and thus obtain the code package.

Custom Image

When it comes to “image,” we easily correspond it to the concept of “image repository.” Indeed, when designing a platform or a system, relying as much as possible on existing basic services is a key factor in further accelerating and strengthening the construction of our services.

Actually, not only custom image files for function code, but also the services of the function computing platform itself exist in the form of images, which can be stored in image repositories.

Therefore, it is very necessary to choose a suitable container image service. Also, remember to choose private limitations when creating an image repository to ensure security. The container image services provided by cloud providers not only support managed image repositories but also typically support advanced capabilities such as image security scanning and multi-region image acceleration.

Mainstream public cloud service providers in China all have related container image services. For example, Alibaba Cloud’s ACR, Tencent Cloud’s TCR, Huawei Cloud’s SWR, and Baidu Cloud’s CCR are currently quite mature. You can consider various factors such as price and usage region to make a choice.

Storage of Data Plane #

Next, let’s take a look at the storage design and access of the data plane. In the sixth lesson, I talked to you about the scheduling process after the traffic comes in. Here, we can extract a few key points related to data retrieval.

Metadata Retrieval: Production-level function computing platforms usually separate the data control and scheduling capabilities of the control plane. Therefore, the traffic scheduling service needs to obtain function information through a microservice (such as APIServer). To improve concurrency and performance, we can consider using caching to store infrequently changed metadata.
Resource Occupation: We know that cold start is actually the scheduling process of occupying a Cold Pod. Since production-level resource control modules are usually deployed in a distributed manner, if two resource control modules occupy the same Cold Pod based on different requests, who will get this Pod? How to ensure that it is not repeatedly acquired? This involves the issue of locking.
Concurrency Sorting: Function instance Pods can support multiple concurrent operations. In this case, how do we balance the scheduling? For example, how do we know which Pod has more concurrent calls, and which one is currently idle? This involves the problem of sorting and scheduling.

In addition, in asynchronous scenarios, we also need to consider issues such as duplication and loss. We will find that all these problems can actually be solved through distributed caching middleware, such as Redis. It can solve the performance and concurrency issues mentioned above, as well as resource locking issues, sorting, and deduplication.

Let’s take a look at a Pod sorting problem that is more relevant to the function scenario. You can use the ZSET data structure in Redis to record the function’s unique identifier with a key, the associated warm pod with a member, and the concurrency of the current warm pod with a score.

Key: Function's Unique Identifier
Member: Pod's IP
Score: Concurrency

This way, we can associate functions, resources, and requests together. When multiple resource control modules occupy the same resource, whoever successfully acquires the Cold Pod will write it into the corresponding ZSET according to the key first, so that other resource control modules cannot occupy it. And because the score records the current concurrency of the Pod, we can also determine the level of concurrency and whether it is idle based on the score value.

All of these can be achieved through caching middleware like Redis, and the data in the caching middleware can also be preloaded. When operating in the console, we can preload it into Redis in advance.

Next, let’s take a look at the data that runs through the entire system: logs and service status.

Logs #

Logs should be familiar to you. In function computing, it is important how the runtime logs are designed and stored. This is related to the troubleshooting of the entire system, the observability of the traceability, and the ability to generate reports and statistics.

Regarding observability, we have already discussed it in detail in the previous section. Here, let’s talk about how to use the log generation platform to generate report data, such as the number of calls to the platform in a day, the amount of resources consumed by a certain function in a day, etc.

Although reports originate from the collection and aggregation of logs, there are still obvious differences from the storage of logs:

Report data only needs some key metrics, such as the number of function calls, resource usage, and execution time. We actually don’t care about auxiliary node log information.
Report data needs to be stored for a longer period of time. Execution logs may only need to be saved for 1-2 weeks or 1 month, and logs are usually periodically cleaned. However, reports are usually related to revenue and are typically saved for years.

For the storage of reports, I recommend using Apache Doris. It was initially developed as a dedicated system for Baidu’s Phoenix statistics report. After multiple iterations, it has been built into a cloud data warehouse based on MPP architecture and contributed to the Apache Foundation. Therefore, in terms of technical accumulation and business improvement, Doris is definitely your first choice.

Doris not only supports MySQL protocol, but also uses standard SQL. Just this feature alone is enough to allow you to quickly get involved. In addition, its import and surrounding ecosystem are also very rich. For example, Logstash, which we use for log data cleaning, can be used as an input for Doris. This way, our report can filter out key information through Logstash and save it to Doris.

Of course, you can also choose ClickHouse, GreenPlum, or a self-developed engine for your OLAP database based on the history of your team or the infrastructure of your business. Due to the complexity and difficulty of operating the underlying storage system, choosing a suitable storage platform for the operations team or a database product from a cloud vendor is also a good solution.

Service Status #

In addition, the services involved in the function execution also need to consider high availability deployment. At this time, the service status also needs to be recorded.

For example, for offline scaling modules, they are often deployed in a primary-backup manner, which involves the issue of leader selection. In addition, some services may also require horizontal scaling and therefore need to use the capabilities of service registration and discovery.

The most common solution to this type of problem is to use a registry center that can synchronize service status to handle leader selection, notification, coordination, and message publishing for different service modules.

In today’s cloud-native era, I recommend using Etcd. It is inherently designed for distributed coordination (such as event watching, leases, elections, and distributed locks) and it positions itself as the infrastructure for cloud computing. Many upper-level systems, such as Kubernetes, CloudFoundry, and Mesos, use Etcd. Additionally, several large internet companies, such as Google and Alibaba, are also widely using it, making it an important solution for medium to large-scale cluster management systems.

Etcd’s official website provides specific comparisons with ZooKeeper, Consul, and NewSQL, focusing on 10 dimensions such as concurrency primitives, linear reads, and multi-concurrent version control.

In short, Etcd and ZooKeeper solve the same problem, which is the coordination of distributed systems and the storage of metadata. The difference between the two is due to differences in design concepts. Etcd is relatively “younger and lighter” compared to ZooKeeper, and it has made some improvements such as leases and MVCC based on ZooKeeper. For stability and new features, if it is a new application, I recommend using Etcd.

The advantage of Consul lies in service discovery. It provides built-in health checks, failure detection, and DNS services. It has a different emphasis from Etcd in terms of the problems it solves. If it is for consistent key-value storage in distributed systems, I recommend using Etcd. If it is end-to-end service discovery, then Consul is better.

NewSQL is more suitable for storing GB-level data or scenarios that require complete SQL query capabilities.

Summary #

Finally, let me summarize our content today. In this lesson, I introduced the storage design ideas and resource selection of FaaS-based Serverless platforms from the perspectives of control plane and data plane.

We learned about different data entities and some methods and considerations for selection during runtime. We need to make a reasonable choice from the perspectives of data characteristics, enterprise infrastructure, developers’ familiarity cost, business scale, and so on, rather than solely evaluating based on technical indicators of storage resources.

As I mentioned in the discussion on logging storage, if the cost of self-operating Doris is relatively high, can we “borrow strength” and choose the database system that the team is already using? Whether it is building a platform or a business, ROI is also a factor we need to consider.

Regarding the relational databases in the control plane discussed today, can storing functions as keys be another option? Yes, selecting a KV-like storage is also possible. It depends on how the architect measures the complexity and convenience of the system implementation. This measurement varies depending on business and system development stages.

Therefore, MySQL, PostgreSQL, Redis, Doris, Etcd, and other systems mentioned today are just concrete expressions for our consideration. What we should be more concerned about is the mindset of building a system.

Homework #

Alright, this class has come to an end, and I have a small homework assignment for you.

Based on the knowledge you gained from this class, think about what other data needs to be stored if you were to design a Serverless platform and how you would store it.

Feel free to write down your thoughts and answers in the comments section. Let’s have a discussion and exchange ideas together. Thank you for reading, and feel free to share this class with more friends for learning and discussion.