21 System Architecture Should a System That Handles 100,000 Requests per Day Be Service Oriented and Decoupled

21 System Architecture - Should a System That Handles 100,000 Requests Per Day Be Service-Oriented and Decoupled #

Hello, I’m Tang Yang.

Based on the content of the previous chapters, you have already optimized your vertical e-commerce system in terms of performance, availability, and scalability from the perspectives of databases, caching, and message queues.

Now, your system is running smoothly and receiving positive feedback. During peak hours, you are already handling 10,000 requests per second, and your daily active users (DAU) have reached several hundred thousand. The CEO is very pleased and plans to continue improving product features for the next round of operations and promotions, aiming to achieve a DAU of over one million during the next Double 11 (Singles’ Day) event. At this point, you start considering how to transform your system to support even higher concurrent traffic, such as supporting a DAU of over one million.

Therefore, you review your system architecture and analyze areas for optimization.

Currently, your system is deployed using a monolithic architecture, which means that all functional modules, such as order, user, payment, and logistics modules in the e-commerce system, are bundled together in a large web application and deployed on application servers.

You have a vague feeling that this deployment approach may have some issues, so you Google it and discover that when a system reaches a certain stage of development, it should be split into microservices. You also come across Alibaba’s “Five Color Stone” project, which greatly impacted the scalability of Alibaba’s overall architecture. All of this fascinates you.

However, one question keeps bothering you: What exactly motivates us to split a monolithic architecture into a microservices architecture? Does it mean that when the overall QPS reaches 10,000 or 20,000, we must definitely split it into microservices?

Pain Points of Monolithic Architecture #

Let’s first recall why you chose to use monolithic architecture in the first place.

When the e-commerce project was just starting, you simply wanted to quickly build the project, facilitate early product launch, and quickly complete validation.

In the early stages of system development, this architecture did bring great convenience to your development and operation, mainly reflected in the following aspects:

Simple and direct development, with centralized code and project management.
Only need to maintain one project, saving manpower costs for maintaining system operation.
When troubleshooting, you only need to investigate the application process, with clear target.

However, as the functionality became more complex and the development team grew larger, you gradually felt some drawbacks of the monolithic architecture, mainly in the following aspects.

First, at the technical level, the number of database connections could become a bottleneck for the system.

As mentioned in Lecture 7, database connections are relatively heavy resources. The connection process is time-consuming, and the number of clients connecting to MySQL is limited, with a maximum setting of 16384 (in actual projects, it can be adjusted based on business needs).

This number seems large, but because your system is deployed using a monolithic architecture without layering in the deployment structure, the application servers directly connect to the database. Therefore, when the frontend request volume increases and application servers need to be scaled, the number of database connections also increases. Let me give you an example.

In a system I previously maintained, the maximum number of database connections was set to 8000. The application servers were deployed on virtual machines, with approximately 50 servers, each establishing 30 connections with the database. However, the number of database connections far exceeded 30 * 50 = 1500.

This is because you need to support external network traffic from clients, as well as deploy separate application services to support internal network calls from other departments and deploy queue processing machines to handle messages from message queues. All of these services connect directly to the database. When added together, during peak periods, the number of database connections comes close to 3400.

Therefore, whenever there are large-scale marketing and promotion activities, servers need to be scaled, and the number of database connections increases accordingly, almost constantly being at the edge of the maximum connection limit. This is like a time bomb that can affect the stability of the service at any time.

The second point is that monolithic architecture increases development costs and inhibits the improvement of development efficiency.

“The Mythical Man-Month” once mentioned that the cost of internal communication within a team is related to the number of personnel, approximately equal to n(n-1)/2. This means that as the number of team members increases, the cost of communication grows exponentially. For example, a team of 100 people requires approximately 100(100-1)/2 = 4950 communication channels. To reduce communication costs, we generally divide the team into several small teams, each consisting of 5-7 members responsible for the development and maintenance of a specific functional module.

For example, your vertical e-commerce system team may be divided into user group, order group, payment group, product group, etc. When so many small teams maintain the same codebase and system, problems can arise in coordination.

Communication between different teams is limited. If one team needs a feature to send text messages, some developers may think that the fastest way is not to ask if other teams already have one, but to develop it themselves. However, this line of thought is inappropriate and can result in duplication of development efforts for the same functionality.

Because the code is deployed together and everyone is committing code to the same code repository, code conflicts cannot be avoided. At the same time, there is strong coupling between functionalities. You may have only made a small logic change, but it may render other functionalities unusable, resulting in the need for regression testing of the entire functionality, extending the delivery time.

Modules depend on each other, so if a member of a small team makes a mistake, it may affect services maintained by other teams, greatly impacting the overall system stability.

The third point is that monolithic architecture also has a significant impact on system operations and maintenance.

Imagine that in the early stages of the project, your code may have only a few thousand lines, and building it once only takes a minute. This allows you to frequently and agilely deploy changes and fix problems. However, when your system grows to hundreds of thousands, or even millions, of lines of code, the time required for a single build process, including compilation, unit testing, packaging, and uploading to the production environment, can reach more than ten minutes. Additionally, any small modification requires building the entire project, and the process of deploying changes becomes very inflexible.

All of these problems can be solved by breaking down the system into microservices.

How to Use Microservices to Solve These Pain Points #

Previously, when I was working on a community business, I initially adopted a monolithic architecture. The database was already vertically sharded, with separate databases for users, content, and interactions. The project was also divided into separate business pools, namely the user pool, content pool, and interaction pool.

As the frontend traffic increased, we found that regardless of which business pool, the user module had the highest request volume. The user database also had the highest request volume. This was easy to understand since both content and interactions required queries to the user database to obtain user data. Therefore, even though we had divided the business pools, each pool still needed to connect to the user database with high request volumes, which made the user database prone to becoming a bottleneck.

So how do we solve this problem?

In fact, we can deploy the logic related to users as a separate service. Whether it is the user pool, content pool, or interaction pool, they can all connect to this service to obtain and modify user information. In other words, only this service can connect to the user database, and other business pools do not directly connect to the user database to obtain data.

Since this service only deals with user-related logic, it does not need many instances to handle the traffic. This effectively controls the number of connections to the user database and improves system scalability. With this approach, we can also separate the logic related to content and interactions to form content and interaction services. In this way, we can solve the scalability problem at the database level by splitting the system horizontally based on business functions.

Another example is when working on a community business, multiple modules may need to use a geolocation service to convert IP or geographical coordinates into city information. For example, when recommending content, nearby content can be based on the user’s city information. When displaying content information, city information needs to be shown, and so on.

Implementing this logic in each module would result in code replication. Therefore, we can wrap the conversion of IP or geographical coordinates into city information into a separate service for other modules to use. In other words, we can extract common services unrelated to the business and sink them into separate services.

By splitting the system according to the two methods mentioned above, each service has cohesive functionality, clear responsibilities for maintainers, and adding new functionality only requires testing their own service. If a service has issues, it can also reduce the impact on other services through service circuit breaking and degradation (I will explain this systematically in Lesson 34).

In addition, since each service is only a subset of the original system, the codebase is much smaller compared to the original system, leading to faster build speeds.

Of course, after microservices are implemented, the original monolithic system is divided into multiple sub-services, which introduces additional problems in both development and operation. What are these problems? How can we solve them? In the next lesson, I will guide you through understanding them.

Summary of the Course #

In this class, I mainly introduced to you the considerations for implementing microservices in actual business scenarios. In fact, the system’s QPS (Queries Per Second) is not the determining factor. I have summarized the influencing factors as follows:

Expandability issues with the resources used in the system, especially bottlenecks in database connections.
Maintaining a single codebase for a large team leads to reduced development efficiency and increased development costs.
Increasing deployment costs.

From this, you should have some realizations: In the early and middle stages of architecture evolution, performance, availability, and scalability are the main goals we pursue. High performance and availability provide users with a better experience, and scalability allows us to handle larger levels of concurrency. However, as the system grows larger and the team size increases, we have to consider cost.

The term “cost” here has a complex meaning. It not only represents the cost of purchasing servers but also includes development teams, internal development costs, communication costs, operations and maintenance costs, and so on. Sometimes cost can become a decisive factor in architecture design.

For example, if you are building a live streaming system, in addition to focusing on the streaming speed, you also need to consider CDN costs. Another example is when you are a team leader, in addition to pushing forward normal feature development, you also need to consider improving the toolchain and increasing the development efficiency of engineers to reduce development costs.

This is easy to understand. If your tool saves 10 minutes per person per day in a team of 100 people, that would add up to nearly 17 hours, which is almost equivalent to adding 2 person-hours. It is based on considerations of improving scalability and reducing costs that we eventually embarked on the path of microservices.