19 Traffic Dispatch DNS Full Site Acceleration and Machine Room Load Balancing

19 Traffic Dispatch DNS Full-Site Acceleration and Machine Room Load Balancing #

Hello, I am Xu Changlong.

In the previous lesson, we learned how to deal with traffic pressure from an architectural design perspective. Services like live streaming are not easy to predict in terms of user traffic. When the user traffic increases beyond what a single data center can handle, it is necessary to dynamically dispatch a portion of users to multiple data centers.

At the same time, as traffic increases, the possibility of network instability also increases. Only by allowing users to access the nearest data center can we ensure a better user experience.

Taking into account all of the above considerations, in this lesson, we will focus on the key technologies of traffic dispatch and data distribution, helping you understand how to effectively switch traffic among multiple data centers.

Live streaming services mainly consist of two types of traffic: static file access and live streams. These can be distributed through CDNs to reduce the load on our servers.

For services like live streaming, which have high read and write requirements, dynamic traffic dispatch and data caching/distribution are the basis for handling a large number of users’ online interactions. However, they overlap with DNS in terms of functionality and need to be implemented together. Therefore, in the explanation, we will also introduce CDN intermittently.

DNS Domain Name Resolution and Caching #

Service traffic switching is not as simple as we imagine, because we may encounter a big problem, which is DNS caching. DNS is the first step in our request process. If DNS is slow or resolves incorrectly, it will seriously affect the interaction effect of read-intensive and write-intensive systems.

So why does DNS have a slow refresh? This requires us to understand the DNS resolution process first. You can follow along with the analysis while referring to the diagram below:

Image

When a client or browser initiates a request, the first service to be requested is DNS. The domain name resolution process can be divided into the following three steps:

  1. The client requests the DNS resolution service provided by the ISP, and the DNS service of the ISP will first request the root DNS server.
  2. The root DNS server is used to find the top-level domain DNS server for .org.
  3. The top-level domain server is used to find the domain authoritative DNS server.

After finding the authoritative DNS server, DNS will start resolving the domain name.

Generally, the authoritative DNS server is provided by the domain service provider where we host the domain, and the specific resolution rules and TTL time are set in the management system of the domain service provider.

When requesting the main domain resolution service, the authoritative DNS server will return the entrance IP of the server’s data center and the suggested caching TTL time. At this point, the DNS resolution query process is considered complete.

When the main domain service returns the result to the ISP DNS service, the ISP’s DNS service will first cache this resolution result locally according to the TTL specified time before returning the result to the client. Within the TTL of the ISP DNS cache, the same domain resolution request will return the result directly from the ISP cache.

It can be anticipated that the client will cache the DNS resolution result, and in actual operation, many clients will not follow the TTL suggested by DNS caching, but will prioritize the configured time.

At the same time, the ISP service providers involved will also record the corresponding cache. If we make changes to the domain resolution, it will take at least the time for the service provider to refresh its own servers (usually 3 minutes) + TTL time to obtain the update.

In fact, the worst-case scenario is as follows:

// Network-wide refresh domain resolution cache time
Client local resolution cache time: 30 minutes
+ City-level ISP DNS cache time: 30 minutes
+ Provincial-level ISP DNS cache time: 30 minutes
+ Time for the main domain service provider to refresh the resolution server configuration: 3 minutes
+ ... Subnet situation of subsequent ISPs: Ignored
= Actual update time for domain resolution: 93 minutes or more

For this reason, many domain resolution services suggest setting the TTL to be no more than 30 minutes, and many large Internet companies artificially reduce cache time on the client side. If you set the time too short, although the refresh is fast, it will cause the service requests to be very unstable.

Of course, 93 minutes is an ideal situation. Based on experience, it takes 48 hours for most national DNS caches to be updated after normal domain modification, while it takes 72 hours to refresh the global cache. So, if not absolutely necessary, do not change the main domain resolution.

If you need to refresh urgently, I suggest you purchase a service that forces the push of resolution to refresh the main ISP’s DNS cache. However, this service is not only expensive but also can only cover the main trunk lines of major cities, and there may still be slow refresh in some areas (depending on the broadband service provider). But overall, it will indeed speed up the refresh of DNS cache.

The problem of slow DNS refresh has brought us a lot of troubles. If we do a failover, it will take three days to complete the switch, which obviously has a devastating impact on the system’s availability. Fortunately, there are many technologies in modern times that can solve this problem, such as CDN, GTM, HttpDNS, and other services. Let’s take a look at them one by one.

CDN Website Acceleration #

You might wonder, “Why does speeding up DNS cache refresh and CDN have a relationship?”

Before discussing how to achieve CDN acceleration, let’s understand what CDN is and how its website acceleration technology works. Website acceleration is important for systems that have high read and write operations. Generally speaking, common CDNs provide static file acceleration services, as shown in the following diagram:

Image

When a user requests a CDN service, the CDN service will first return the locally cached static resources.

If the CDN does not have the requested resource in its local cache or if the resource is dynamic (e.g., an API interface), the CDN will fetch content from our server, while also refreshing the local cache based on the resource’s timeout value that we specify on the server side. This can greatly reduce the pressure on our data service in the server room, saving a significant amount of bandwidth and hardware resources.

In addition to accelerating static resources, CDN also provides localized local CDN network acceleration services, as shown in the following diagram:

Image

CDN deploys acceleration service data centers in major provinces and cities, and high-speed dedicated lines are used to connect these data centers.

When a client requests DNS for domain name resolution, the DNS service in the client’s province or city will use GSLB to return the IP address of the nearest CDN data center. This greatly reduces the number of network link nodes between the user and the data center, speeds up the network response, and reduces the possibility of network requests being intercepted.

The path effect of a client requesting a service is shown in the following diagram:

Image

If a user requests a dynamic interface from a fully accelerated website, the CDN node will forward the user’s request to our server through the CDN intranet, using the shortest and fastest network link.

Compared with the method where the client’s request is forwarded from other provinces through multiple ISP networks before reaching the server, this approach can better address the problem of slow network and provide a better user experience for the client.

After the website is fully accelerated, all user requests will be forwarded by CDN, and all domain names requested by the client will also point to CDN, which will then forward the requests to our server.

During this process, if the data center changes the IP address that provides CDN services, in order to speed up the refresh of DNS cache, you can use the CDN intranet DNS service (provided by the CDN provider) to refresh the DNS cache in CDN. By doing so, the client’s DNS resolution will remain unchanged, and domain name refresh will be more convenient without waiting for 48 hours.

Due to the issue of cache refresh taking 48 hours, most internet companies do not use the method of changing DNS resolution configuration for failover when switching data centers, but rely on CDN to perform similar functions. However, if there is a failure in the CDN entry, it will also have a significant impact on website services.

To mitigate entry failure issues, foreign companies use anycast technology. With anycast technology, multiple data center entry points can have the same IP address. If one entry point fails, the traffic will be redirected to another data center. However, due to security reasons, anycast technology is not supported domestically in China.

In addition to the risk of CDN entry failure, there is also a problem where if the CDN does not have the requested resource in its local cache and the local website service fails, the source cannot be automatically switched to multiple data centers. Therefore, to enhance availability, we can consider adding GTM behind the CDN.

GTM Global Traffic Management #

Before I explain how GTM works in combination with CDN, let me first tell you about the working principle and main features of GTM.

GTM stands for Global Traffic Management system. I have created a diagram to help you better understand:

Image

When a client requests a service domain, the client first requests DNS service to resolve the domain. When the client requests the main domain’s DNS service to resolve the domain, it will request the GTM service for intelligent DNS resolution.

Compared to traditional technologies, GTM has three additional features: service health monitoring, multi-line optimization, and traffic load balancing.

First is the service health monitoring feature. GTM monitors the working status of servers, and if it detects that a data center is unresponsive, it automatically switches the traffic to a healthy data center. Furthermore, GTM provides fault tolerance, which means that based on the capacity and weight of data centers, it can transfer some user traffic to other data centers.

Second is the multi-line optimization feature. In China, there are different broadband service providers (China Mobile, China Unicom, China Telecom, and education broadband), and users accessing websites through the same provider’s gateway IP tend to have better performance. If users access across different providers, there will be increased request latency due to network forwarding. Therefore, by using GTM, we can find a faster access path based on the CDN sources of different data centers.

GTM also provides traffic load balancing, which distributes the traffic based on the traffic monitoring and request latency of the services, thus intelligently scheduling the client’s traffic.

When GTM is combined with CDN website acceleration, the effect is even better. The specific combination is shown in the following diagram:

Image

Since both GTM and CDN acceleration use CNAME for redirection, we can first direct the domain to CDN, which provides network acceleration services to clients through CDN’s GSLB and internal network. When CDN needs to request a resource from the origin server, the request will be forwarded to GTM for DNS resolution. After GTM resolves the DNS, it will distribute the CDN traffic to various data centers for load balancing.

When our data center fails, GTM quickly removes the faulty data center from the load balancing list, which satisfies our network acceleration needs, achieves load balancing among multiple data centers, and enables faster failover.

However, even with CDN+GTM, there may still be a group of users experiencing slow network access. This is because the DNS services provided by many ISPs are not perfect. Our users may encounter issues such as DNS pollution, man-in-the-middle attacks, and DNS resolution misplacement.

To mitigate these issues, we need to force HTTPS protocol for external services on top of existing services, and we also recommend enabling HttpDNS service in the client app in conjunction with GPS positioning.

HttpDNS Service #

The HttpDNS service can help bypass the DNS service provided by the local ISP to prevent DNS hijacking, and it avoids the issue of DNS caching. Additionally, HttpDNS also provides GSLB (Global Server Load Balancing) functionality. HttpDNS can also allow for custom resolution services, enabling gray testing or A/B testing.

Generally speaking, HttpDNS can only solve the service scheduling issue on the app side. Therefore, if a client program uses HttpDNS service, a fallback plan should also be implemented to address the problem of domain resolution failure caused by HttpDNS service failure.

Here is a recommended fallback order for resolution services: HttpDNS is usually the first choice, followed by a DNS service with specified IP, and finally the DNS service provided by the local ISP. This approach can significantly improve the security of client DNS.

Of course, we can also enable DNS Sec to further enhance the security of DNS service. However, all of the aforementioned services need to be decided based on our actual budget and resource allocation.

However, HttpDNS service is not free, and it can be quite costly for large enterprises, as many HttpDNS service providers charge based on the number of requests made.

Therefore, in order to save costs, we will try to minimize the number of requests. It is recommended to use DNS caching based on the client’s network IP and hotspot name (Wifi, 5G, 4G) as identifiers when using the app.

Business Self-implemented Traffic Scheduling #

HttpDNS service can only solve the problem of DNS pollution, but it cannot participate in our business scheduling. Therefore, its support is limited when we need to control and schedule based on business needs.

To improve user experience, internet companies have combined the principles of HttpDNS to implement traffic scheduling. For example, many live streaming services that cannot control user traffic have implemented traffic scheduling services similar to HttpDNS. The common implementation method of scheduling services is to allocate clients to nearby data centers through client requests to the scheduling service.

This scheduling service can also achieve fault recovery for data centers. If there is a failure in the server cluster, client requests to the data center will result in failures, buffering, and delays. At this time, the client will actively request the scheduling service. If the scheduling service receives a command to switch data centers, it will return the IP of a healthy data center to the client, thereby improving service availability.

The scheduling service itself also needs to improve availability. The specific approach is to deploy the scheduling service in multiple data centers, and multiple scheduling data centers will synchronize user scheduling result strategies through Raft consensus.

Let me give you an example. When a user requests scheduling for data center A and is scheduled to the Beijing data center, if this user requests scheduling service for data center B again in the short term, they will still be scheduled to the Beijing data center. Only when the client switches networks or our service data center experiences a failure will there be a unified traffic change.

To improve the user experience of the client, we need to allocate them to the nearest data center with the best response performance. For this purpose, we need some auxiliary data to support the allocation of the scheduling service to clients. This auxiliary data includes IP, GPS location, internet service provider, ping network speed, and actual playback effect.

The client will collect this data periodically and provide it to the big data center for analysis and calculation, providing reference suggestions to help the scheduling service make better decisions on which data center and corresponding network route to connect.

In fact, doing so is equivalent to self-implementing GSLB (Global Server Load Balancing) functionality. However, the data for self-implemented GSLB functionality is not absolutely correct, because DNS service resolution results differ in different provinces and cities. Additionally, if the client cannot establish a connection, recommended IPs need to be tried one by one to ensure high service availability.

In addition, to verify the stability of the scheduling, we can temporarily store the scheduling results on the client and carry the current scheduling results in the header with each client request. This way, we can monitor on the server side if there are any client error requests to other data centers.

If erroneous requests are found, we can use a data center gateway to perform a reverse proxy forwarding similar to CDN site-wide acceleration, ensuring client stability.

Similar scheduling functions need to be implemented for live streaming and video. When we encounter buffering issues while playing videos or live streaming, if a significant amount of buffering is observed, the client should be able to automatically switch the video source and report the situation to the big data center for recording and analysis. If extensive video buffering is detected, the big data center will send alerts to our operations and research and development partners.

Summary #

Image

The domain name is the primary entrance for accessing our services. When requesting a domain name, it first needs to be resolved into an IP address through DNS. However, frequent DNS requests can impact service response speed, so many clients and ISP providers cache DNS. However, this multi-level caching makes it difficult to refresh domain name resolution.

Even if we pay to refresh the cache of multiple bandwidth service providers, we still need to wait at least 48 hours for most users’ cache to be refreshed in certain regions.

If we need to switch IP due to website failure or other special reasons, the impact will be catastrophic. Fortunately, in recent years, we can enhance the traffic scheduling of our multi-data center through CDN, GTM, and HttpDNS.

However, CDN and GTM are focused on data center scheduling and are transparent to the business. Therefore, in high-concurrency scenarios where user experience is highly valued, we will implement our own scheduling system.

In this self-implemented solution, you will find that the ideas in the implementation are similar to HttpDNS and GSLB, with the difference being that the previous services were only basic services, while our self-implemented service can also quickly help us schedule user traffic.

Using HttpDNS to implement user data center and video stream switching is undoubtedly very convenient and simple. Just change the IP in the link encapsulation of our app’s request to achieve seamless data center switching for the business.

Reflection Questions #

How can long connections such as videos and WebSockets dynamically switch data centers?

Welcome to leave your comments in the discussion area and let’s meet again in the next class!