23 Gateway Programming How to Reduce Development Costs Through User Gateways and Caching

23 Gateway Programming How to Reduce Development Costs Through User Gateways and Caching #

Hello, I am Xu Changlong.

If we consider user traffic as waves in a turbulent sea, then a gateway is the defensive dam against the impact. In large Internet projects, a gateway is essential and is currently the best defense mechanism we have. With a gateway, we can divert a large amount of traffic to various services. If we use the capabilities provided by Lua script engines, we can also greatly reduce system coupling and performance loss, saving our costs.

Generally speaking, gateways can be divided into external gateways and internal gateways. The main responsibilities of the external gateway include rate limiting, intrusion prevention, and request forwarding. The common approach is to use Nginx + Lua to perform similar tasks. In recent years, various customized functional gateways have emerged for internal network gateways, such as ServiceMesh, SideCar, and similar ones like Kong and Nginx Unit. Although their purposes may differ, their main functions are still load balancing, traffic management and scheduling, and intrusion prevention.

So what crucial functional support does a gateway provide? Let’s analyze it in this lesson.

Functions of the External Network Gateway #

Let’s start with the usage of the external network gateway. I will share with you two practical designs of external network gateways that can help us prevent invasion and reduce dependence on business.

Spider Sniffing and Identification #

Websites with high traffic often face issues such as attacks, spider crawling, and even hacker invasions. With a gateway, we can implement functions like rate limiting and intrusion detection to prevent common invasions.

I want to share with you two most common and serious issues: illegal references and robot crawling.

Generally, illegal usage involves a large number of requests for our network resources. To prevent this, we can detect requests by referring to the Referer header. If the referer is not our own domain, we can reject the user’s request. This method reduces the risk of our resources being illegally used.

The other type of issue is robot crawling. To identify robot crawling, we need some tricks.

First, we need to set the scope. There are usually two types of users: anonymous users and logged-in users. For anonymous users, we analyze the request hotspots and IP addresses based on the time of request to identify high-frequency IP addresses. For logged-in users, we use the same method to count their request times and frequencies within a certain time frame. If the threshold is exceeded, we reject the request and add the user to the suspicious list for further investigation.

To confirm the behavior of users on the suspicious list, here is a technique with a low probability of misjudgment.

When a suspicious user makes a request, the gateway dynamically injects JavaScript sniffing code into the specific user or IP address. This code will write a special ciphertext into the cookies and localStorage.

When our frontend JavaScript code detects this ciphertext, it enters anti-bot mode. Anti-bot mode can determine whether the client has mouse movement and click actions, thereby judging whether the user is a robot. After confirming that the user is legitimate, the frontend sends a request with a new signature to unlock the service on the server. If the client does not respond, the user is automatically considered a candidate for banning, and the request is blocked. If an IP address has a certain number of blocked requests, it will be banned.

However, this design has a drawback: it is not SEO-friendly, as bots from major search engines will be rejected. Our previous solution was to use a whitelist to allow bots from major search engines. Specifically, we whitelist the User-Agent of the bots and periodically manually verify the IP addresses of the search engine bots.

In addition, for some core and important interfaces, we can add a rule that “a signature with a timestamp must be included in the request for it to be accepted, otherwise the service will be rejected.” This rule can prevent some robot crawling.

Gateway Authentication and Decoupling from User Center #

I just shared with you some techniques to use the gateway to block illegal users. Besides defending against attacks and preventing malicious resource consumption, the gateway can also help us decouple some business dependencies.

Remember the user login design we discussed in Lesson 3? Each business can verify the legitimacy of users without relying on the user center. User authentication is usually implemented by integrating the user center’s SDK for unified verification logic.

However, this also raises a problem: dependency and synchronization issues with SDK upgrades. Basic common components usually provide SDKs for easier business development. If only API services are provided, some special operations need to be implemented repeatedly. But once the SDK is released, we need to prepare for the simultaneous maintenance of multiple versions of the SDK.

The following diagram shows the effects of authentication by SDK token and user center interface mentioned in Lesson 3:

Authentication by SDK token and user center interface

As shown in the diagram, integrating SDKs allows each business to independently verify the user’s identity without requesting the user center. However, since there are multiple versions of the SDK, future upgrades to the user center will face great resistance because we need to consider all the “user” businesses.

SDKs are components embedded in the other party’s projects. For stability, many projects do not frequently upgrade and modify the version of the component. This makes it difficult to upgrade the user center. Each major upgrade of the basic service requires a large amount of manpower to synchronize and update the SDK, increasing the difficulty of project maintenance.

So, besides using the SDK, is there any other way to avoid this component coupling? Here I share an interesting design, which is to put the user login authentication function in the gateway.

I have used a diagram to describe the request process. You can refer to the diagram while I continue to analyze it.

Gateway authentication and decoupling from user center

Combining the diagram, let’s examine the workflow of this implementation. When a user’s business request reaches the business API, the gateway first verifies the user’s identity.

If the verification is successful, the user’s information is passed to the subsequent services through the headers. The business API does not need to be concerned about the implementation details of the user center; it only needs to receive the user information attached in the headers to work directly. If the business requires user login to use its services, we can add a check in the business to see if the request headers contain the uid. If not, a unified error code is returned to remind the user to log in first.

It can be seen that this authentication service design decouples the business and the user center. If there are logic changes in the user center, there is no need for the business to upgrade accordingly.

In addition to common login authentication, we can enable Role-Based Access Control (RBAC) service for some domains, customize different RBAC and ABAC services based on the needs of different businesses, and enable different permissions and gray testing for different users through the gateway.

Internal Network Gateway Service #

After understanding the two amazing uses of external networks, let’s take a look at the functions of the internal network. It can provide failure retry service and smooth restart mechanism, let’s look at them separately.

Failure Retry #

During the upgrade of our project release, or in the event of a crash, the service will be temporarily unavailable. If a user sends a service request at this time, a 504 error will be returned due to no response from the backend, which will result in a poor user experience.

In the face of this situation, we can use the automatic retry function of the internal network gateway. In this way, when the request is sent to the backend and the service returns a 500, 403, or 504 error, the gateway will not immediately return an error, but instead let the request wait for a while and retry, or directly return the previous cached content. This can achieve smooth business hot upgrades, making the service appear more stable, and users will not have a noticeable perception of online upgrades.

Smooth Restart #

Next, let me talk about the mechanism of smooth restart.

During the upgrade of our service, instead of letting the service process exit immediately after receiving the kill signal, we can implement the smooth restart function, which means stopping the service from receiving new requests, waiting for the previous requests to be processed, and exiting directly if the wait exceeds 10 seconds.

With this mechanism, the user request processing will not be interrupted, which ensures that the business transactions being processed are complete. Otherwise, it is likely to cause inconsistent business transactions or only complete half of them.

With these retry and smooth restart mechanisms, we can upgrade and release our code online at any time, and introduce new features. However, after enabling this function, it may shield some online failures. At this time, we can use the monitoring of the gateway service to help us detect the system’s status.

Comprehensive Application of Internal and External Gateways #

Earlier, we discussed the functions provided independently by both the external and internal gateways. Now let’s take a look at their comprehensive application.

Service Interface Caching #

First, let’s consider the gateway interface caching function, which uses the gateway to cache the content of certain interfaces. This is suitable for use in service degradation scenarios to temporarily alleviate the impact of user traffic or reduce the impact on internal network traffic.

The specific implementation is shown in the following diagram:

Based on the diagram, we can see that the gateway caching is generally implemented using temporary cache + TTL (Time to Live). When a user requests the server, if the cached API has been requested before and the cache has not expired yet, the cached content will be directly returned to the client. This approach greatly reduces the data service pressure on the backend.

However, every technical choice is a result of repeated weighing. This approach sacrifices the strong consistency of the data. In addition, this approach has a higher performance requirement for the caching capability. The gateway cache must be able to handle the QPS (Queries Per Second) of the external network traffic.

If you want to prevent excessive penetration traffic, you can also use scripts to periodically refresh the cached data. The gateway will directly return the relevant cache if it is found, and if there is no match, it will make the actual request to the backend server and cache the result. This implementation is more flexible and has better data consistency, but it requires manual effort to write and maintain the code. -

Of course, the length of the data in this cache is recommended not to exceed 5KB (10w QPS X 5KB = 488MB/s) because if the data is too long, it will slow down our cache service response speed.

Service Monitoring #

Lastly, let’s talk about using the gateway for service monitoring. Let’s first consider this question: how would we typically perform monitoring without service tracing?

In fact, most systems use gateway logs for monitoring. We can determine whether the business is normal by examining the HTTP code in the gateway access logs. By combining this information with the response time of different requests, we can achieve basic system monitoring.

To help you further understand, the following diagram shows how to monitor services using the gateway. You can refer to the image while I continue to explain.

Image

To facilitate online situation assessment, we need to first collect statistics. The specific method is to periodically aggregate errors from the access logs, summarizing the number of errors for different API requests. The format is similar to “Within 30 seconds, there were 20 occurrences of 500 errors, 15 occurrences of 504 errors, and 40 occurrences of API responses exceeding 1 second for a certain domain interface.” This helps analyze the service status.

Unlike other monitoring methods, gateway monitoring can monitor all businesses, but the granularity is larger. Nonetheless, it is still a good method. If we combine it with tracing, we can also log the Trace ID in the access logs. This allows us to further investigate the cause of problems based on the Trace ID, making the process more convenient. Similar implementations can be found in companies like Good Future and Geek Time.

Summary #

In this lesson, I shared with you many clever uses of gateways, including preventing intrusion, removing business dependencies, assisting in smooth system upgrades, improving user experience, mitigating traffic impacts, and implementing slightly larger-scale service monitoring.

I have drawn a mind map to summarize the key points for you, as shown below:

mind map

I believe that by now, you have experienced the importance of gateways. Yes, in our system, the gateway plays a crucial role, and the current technology trends also prove this point. As development progresses, gateways are starting to differentiate between intranet and extranet gateways, with differences in functionality and development directions.

Here, I would like to focus on the development of intranet gateways. In recent years, microservices and Sidecar technology have become popular, and like intranet gateways, they address issues such as intranet traffic scheduling and high availability.

Of course, traditional intranet gateways are also being updated, with many excellent open-source projects emerging, such as Kong, Apisix, and OpenResty. These gateways can support Http2.0 long link bidirectional communication and RPC protocols.

The industry has been vigorously discussing whether to choose Sidecar Agent or use an intranet gateway. In my opinion, with the popularity of containerization, intranet gateways will undergo a new transformation. Services such as service discovery, service authorization, traffic scheduling, data caching, service high availability, and service monitoring will eventually be unified into a set of standards. If existing intranet gateways can reduce complexity, they will have an even greater advantage in the future.

Thought-provoking question #

Why do intranets often use gateways or implement service discovery instead of using intranet DNS services to achieve this functionality?

I look forward to interacting with you in the comments area, and I also recommend that you share this lesson with more colleagues and friends. See you in the next lesson!