37 Case Study DNS Resolves Sometimes Fast Sometimes Slow, What Should I Do

37 Case Study DNS Resolves Sometimes Fast Sometimes Slow, What Should I Do #

Hello, I am Ni Pengfei.

In the previous section, I introduced you to the evaluation methods of network performance. A brief overview is that Linux network is built on the TCP/IP protocol stack, and the network performance we are concerned about varies at different layers of the protocol stack.

At the application layer, we focus on metrics such as concurrent connections, requests per second, processing latency, error counts, etc. Tools like wrk and JMeter can be used to simulate user loads and obtain the desired test results.

At the transport layer, we are concerned about the performance of transport layer protocols such as TCP and UDP, including metrics like TCP connections, TCP retransmission, TCP errors, etc. For this, you can use tools like iperf and netperf to test the performance of TCP or UDP.

Moving down to the network layer, we focus on the handling capacity of network packets, known as PPS (packets per second). The pktgen tool that comes with the Linux kernel can help you test this metric.

Since lower-layer protocols form the foundation of higher-layer protocols, network optimization generally includes optimization across all layers of the network protocol stack. However, the specific positions and targets for optimization may vary depending on different performance requirements.

In the previous section, when evaluating network performance (such as HTTP performance), we specified the IP address of the network service in the testing tools. An IP address is an important identifier used in the TCP/IP protocol to determine the communication parties. Each IP address consists of a host number and a network number. Hosts with the same network number form a subnet, and different subnets are connected through routers to form a large network.

However, while IP addresses make communication between machines convenient, they impose a heavy memory burden on people accessing these services. I believe that very few people can remember the IP address of GitHub because this string of characters does not have any meaning to our brains and does not match our memory logic.

However, this does not prevent us from using this service frequently. Why? It is because there is a simpler and more convenient way. We can access it through the domain name github.com instead of relying on a specific IP address, which is the purpose of the Domain Name System (DNS).

DNS (Domain Name System) is the most fundamental service in the Internet, providing mainly query services for mapping between domain names and IP addresses.

DNS not only makes it convenient for people to access different Internet services, but also provides mechanisms for dynamic service discovery and global server load balancing (GSLB) for many applications. Therefore, DNS can choose the IP closest to the user to provide the service. Even if the IP addresses of backend services change, users can still access them using the same domain name.

DNS is obviously a fundamental and important aspect of our work. So, how do we analyze and troubleshoot DNS problems? Today, I’m going to show you how to do it.

Domain Names and DNS Resolution #

We are all familiar with domain names, which are composed of a series of characters separated by dots and used to identify a specific computer or group of computers on the internet. The purpose of domain names is to make it easier for people to recognize and identify the location of the various hosts providing services on the internet.

It is important to note that domain names are globally unique and can only be registered through specialized domain registrars. In order to organize the numerous computers on the global internet, domain names are structured hierarchically, with each segment separated by dots forming a different level in the hierarchy. The position of a string separated by dots determines its hierarchy level, with higher levels located towards the end.

Let’s use the example of the GeekTime website time.geekbang.org to understand the meaning of a domain name. In this string, “org” is the top-level domain, “geekbang” is the second-level domain, and “time” is the third-level domain.

As shown in the following diagram, the dot (.) represents the root of all domain names. This means that all domain names end with a dot, or in other words, all domain names end with a dot in the process of DNS resolution.

By understanding these concepts, you can see that domain names are primarily used for ease of human recognition, while IP addresses are the actual mechanism for communication between machines. The service that translates domain names to IP addresses is called domain name resolution, or DNS. The corresponding server is the domain name server, and the network protocol used is the DNS protocol.

It is important to note that the DNS protocol belongs to the application layer in the TCP/IP stack. However, the actual transmission is based on UDP or TCP protocols (mostly UDP), and domain name servers generally listen on port 53.

Since domain names are managed in a hierarchical structure, domain name resolution is also performed recursively, starting from the top-level domain and working down through each level, until the resolution result is obtained.

However, there is no need to worry about performing the recursive query yourself, as the DNS server will handle it for you. All you need to do is configure a usable DNS server in advance.

Of course, we know that generally, each level of DNS server will have a cache of recently resolved records. When a cache hit occurs, the response can be directly obtained from the cache. If the cache expires or does not exist, the recursive query process mentioned earlier is necessary.

Therefore, when configuring the network on a Linux system, the system administrator needs to configure not only the IP address but also the DNS server, so that the system can access external services using domain names.

For example, my system is configured with the domain name server 114.114.114.114. You can execute the following command to query your system configuration:

$ cat /etc/resolv.conf
nameserver 114.114.114.114

In addition, the DNS service manages all data through resource records. It supports various types of records such as A, CNAME, MX, NS, and PTR. For example:

  • The A record is used to convert a domain name to an IP address.
  • The CNAME record is used to create an alias.
  • The NS record represents the domain name server address corresponding to the domain name.

In summary, when we access a website, we need to use the A record of DNS to query the IP address corresponding to the domain name, and then access the web service using that IP address.

For example, using the website time.geekbang.org as an example again, executing the nslookup command below can query the A record of this domain name. You can see that its IP address is 39.106.233.176:

$ nslookup time.geekbang.org
# Domain name server and port information
Server:		114.114.114.114
Address:	114.114.114.114#53

# Non-authoritative query result
Non-authoritative answer:
Name:	time.geekbang.org
Address: 39.106.233.17

It is important to note that since 114.114.114.114 is not the domain name server directly managing time.geekbang.org, the query result is not authoritative. Using the above command, you can only obtain the result of the query from 114.114.114.114.

Earlier it was mentioned that if the cache is not hit, DNS resolution is actually a recursive process. Is there a way to know the entire process of recursive querying?

In addition to nslookup, another commonly used DNS resolution tool, dig, provides the trace function to display the entire process of recursive querying. For example, you can execute the command below to obtain the query result:

# The +trace option enables trace queries
# The +nodnssec option disables DNSSEC
$ dig +trace +nodnssec time.geekbang.org

; <<>> DiG 9.11.3-1ubuntu1.3-Ubuntu <<>> +trace +nodnssec time.geekbang.org
;; global options: +cmd
.           322086  IN  NS  m.root-servers.net.
.           322086  IN  NS  a.root-servers.net.
.           322086  IN  NS  i.root-servers.net.
.           322086  IN  NS  d.root-servers.net.
.           322086  IN  NS  g.root-servers.net.
.           322086  IN  NS  l.root-servers.net.
.           322086  IN  NS  c.root-servers.net.
.           322086  IN  NS  b.root-servers.net.
.           322086  IN  NS  h.root-servers.net.
.           322086  IN  NS  e.root-servers.net.
.           322086  IN  NS  k.root-servers.net.
.           322086  IN  NS  j.root-servers.net.
.           322086  IN  NS  f.root-servers.net.
;; Received 239 bytes from 114.114.114.114#53(114.114.114.114) in 1340 ms

org.        172800  IN  NS  a0.org.afilias-nst.info.
org.        172800  IN  NS  a2.org.afilias-nst.info.
org.        172800  IN  NS  b0.org.afilias-nst.org.
org.        172800  IN  NS  b2.org.afilias-nst.org.
org.        172800  IN  NS  c0.org.afilias-nst.info.
org.        172800  IN  NS  d0.org.afilias-nst.org.
;; Received 448 bytes from 198.97.190.53#53(h.root-servers.net) in 708 ms

geekbang.org.       86400   IN  NS  dns9.hichina.com.
geekbang.org.       86400   IN  NS  dns10.hichina.com.
;; Received 96 bytes from 199.19.54.1#53(b0.org.afilias-nst.org) in 1833 ms

time.geekbang.org. 600 IN  A  39.106.233.176
;; Received 62 bytes from 140.205.41.16#53(dns10.hichina.com) in 4 ms

The output of dig trace mainly includes four parts.

  • The first part is the NS record of some root domain servers (.) found from 114.114.114.114.

  • The second part is the NS record of the top-level domain org. obtained by selecting one (h.root-servers.net) from the NS record results.

  • The third part is the NS server of the second-level domain geekbang.org. obtained by selecting one (b0.org.afilias-nst.org) from the NS record of org.

  • The last part is the A record of the final host time.geekbang.org obtained by querying the NS server (dns10.hichina.com) of geekbang.org.

The NS records of these domain levels displayed in this output are actually the addresses of their respective domain servers, which can help you understand the process of DNS resolution more clearly. In order to help you understand recursive query more intuitively, I have organized this process into a flowchart that you can save and refer to.

Of course, not only services published to the Internet need domain names, but often we also want to resolve the hostnames (i.e., internal domain names) of hosts within the local area network. Linux also supports this behavior.

So, you can write the mapping relationship between the hostname and IP address into the /etc/hosts file on your local machine. In this way, the specified hostname can directly find the target IP locally. For example, you can use the following command to operate:

$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain
::1         localhost6 localhost6.localdomain6
192.168.0.100 domain.com

Alternatively, within an intranet, you can build a custom DNS server specifically for resolving domain names within the intranet. The intranet DNS server generally also sets one or more upstream DNS servers to resolve domain names in the Internet.

After understanding the basic principles of domain names and DNS resolution, I will show you a few cases to analyze how to locate and troubleshoot DNS resolution issues.

Case Preparation #

This case is still based on Ubuntu 18.04, and is also applicable to other Linux systems. The environment I used for this case is as follows:

  • Machine configuration: 2 CPUs, 8GB memory.

  • Pre-installation of Docker and other tools, such as apt install docker.io.

You can open a terminal first, SSH to the Ubuntu machine, and then execute the following command to pull the Docker image used in the case:

$ docker pull feisky/dnsutils
Using default tag: latest
...
Status: Downloaded newer image for feisky/dnsutils:latest

Then, run the following command to view the DNS servers currently configured on the host:

$ cat /etc/resolv.conf
nameserver 114.114.114.114

As you can see, the DNS server configured on my host is 114.114.114.114.

With this, the preparation work is complete. Next, we will proceed to the formal operation.

Case Study #

Case 1: DNS Resolution Failure #

First, run the following command to enter the first case of today. If everything is normal, you will see the following output:

# Enter the SHELL terminal of the case environment
$ docker run -it --rm -v $(mktemp):/etc/resolv.conf feisky/dnsutils bash
root@7e9ed6ed4974:/#

Note that the ID prefix after root here, 7e9ed6ed4974, is the ID of the container generated by Docker, which may be different in your environment, so you can ignore it.

Note: The commands with /# at the beginning in the code blocks indicate commands running inside the container.

Next, continue in the container terminal and execute the DNS query command. We will still query the IP address of time.geekbang.org:

/# nslookup time.geekbang.org
;; connection timed out; no servers could be reached

You can see that after a long delay, this command still failed and reported the errors “connection timed out” and “no servers could be reached”.

Seeing this, your first reaction is probably that the network is not working. But is that really the case? Let’s test it using the ping tool. Run the following command to test the connectivity from your local machine to 114.114.114.114:

/# ping -c3 114.114.114.114
PING 114.114.114.114 (114.114.114.114): 56 data bytes
64 bytes from 114.114.114.114: icmp_seq=0 ttl=56 time=31.116 ms
64 bytes from 114.114.114.114: icmp_seq=1 ttl=60 time=31.245 ms
64 bytes from 114.114.114.114: icmp_seq=2 ttl=68 time=31.128 ms
--- 114.114.114.114 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 31.116/31.163/31.245/0.058 ms

In this output, you can see that the network is accessible. So how can we find out the reason for the failure of the nslookup command? There are actually many methods, and the simplest one is to enable the debug output of nslookup to view the detailed steps during the query process and check if there are any abnormalities.

For example, we can continue in the container terminal and execute the following command:

/# nslookup -debug time.geekbang.org
;; Connection to 127.0.0.1#53(127.0.0.1) for time.geekbang.org failed: connection refused.
;; Connection to ::1#53(::1) for time.geekbang.org failed: address not available.

From this output, we can see that nslookup failed to connect to the loopback address (127.0.0.1 and ::1) on port 53. There is a problem here. Why did it try to connect to the loopback address instead of the 114.114.114.114 we saw earlier?

You may have already figured out the cause—there may be no DNS server configured in the container. Let’s confirm this by executing the following command:

/# cat /etc/resolv.conf

Sure enough, this command has no output, indicating that there is indeed no DNS server configured in the container. At this point, it is natural for us to know the solution. By configuring the DNS server in the /etc/resolv.conf file, the problem will be solved.

You can execute the following command to configure the DNS server and then re-execute the nslookup command. Naturally, we now find that it can be resolved correctly:

/# echo "nameserver 114.114.114.114" > /etc/resolv.conf
/# nslookup time.geekbang.org
Server:		114.114.114.114
Address:	114.114.114.114#53

Non-authoritative answer:
Name:		time.geekbang.org
Address: 39.106.233.176

With that, the first case is easily solved. Finally, execute the exit command in the terminal to exit the container, and Docker will automatically clean up the container that was just run.

Case 2: Unstable DNS Resolution #

Next, let’s take a look at the second case. Run the following command to start a new container and enter its terminal:

$ docker run -it --rm --cap-add=NET_ADMIN --dns 8.8.8.8 feisky/dnsutils bash
root@0cd3ee0c8ecb:/#

Then, just like the previous case, run the nslookup command to resolve the IP address of time.geekbang.org. However, this time, add a time command to output the time used for resolution. If everything is normal, you may see the following output:

/# time nslookup time.geekbang.org
Server:		8.8.8.8
Address:	8.8.8.8#53

Non-authoritative answer:
Name:		time.geekbang.org
Address: 39.106.233.176

real	0m10.349s
user	0m0.004s
sys	0m0.0

You can see that the resolution is very slow this time, taking 10 seconds. If you run the above nslookup command multiple times, you may occasionally encounter errors like this:

/# time nslookup time.geekbang.org
;; connection timed out; no servers could be reached

real	0m15.011s
user	0m0.006s
sys	0m0.006s

In other words, similar to the previous case, resolution failure may also occur. Taken together, the current DNS resolution results are not only relatively slow but also occasionally result in timeouts.

Why is this happening? How should we handle this problem?

Actually, as explained earlier, DNS resolution is essentially the process of interaction between the client and the server, and this process also uses the UDP protocol.

Therefore, for the entire process, there are many possible scenarios for unstable resolution results. For example:

  • The DNS server itself has issues, such as slow and unstable responses.
  • Or, the network latency between the client and the DNS server is relatively high.
  • Or, DNS request or response packets are occasionally dropped by network devices in certain situations. Based on the output of the above nslookup command, you can see that the DNS that the client is currently connected to is 8.8.8.8, which is a DNS service provided by Google. We trust Google more, so the probability of a DNS server problem should be relatively small. We can basically rule out the problem with the DNS server. Is it possible that there is a large delay between the local machine and the DNS server?

As mentioned earlier, ping can be used to test the latency of a server. For example, you can run the following command:

/# ping -c3 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=31 time=137.637 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=31 time=144.743 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=31 time=138.576 ms
--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 137.637/140.319/144.743/3.152 ms

From the output of ping, we can see that the latency here has reached 140ms, which can explain why the resolution is so slow. In fact, if you run the above ping test multiple times, you will also see occasional packet loss.

$ ping -c3 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=30 time=134.032 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=30 time=431.458 ms
--- 8.8.8.8 ping statistics ---
3 packets transmitted, 2 packets received, 33% packet loss
round-trip min/avg/max/stddev = 134.032/282.745/431.458/148.713 ms

This further explains why nslookup occasionally fails, because of packet loss in the network.

What should we do when we encounter such a problem? Obviously, since the latency is too high, we should switch to a DNS server with lower latency, such as the one provided by China Telecom, which is 114.114.114.114.

Before configuring, we can first use ping to test whether its latency is really better than 8.8.8.8. By executing the following command, you can see that its latency is only 31ms:

/# ping -c3 114.114.114.114
PING 114.114.114.114 (114.114.114.114): 56 data bytes
64 bytes from 114.114.114.114: icmp_seq=0 ttl=67 time=31.130 ms
64 bytes from 114.114.114.114: icmp_seq=1 ttl=56 time=31.302 ms
64 bytes from 114.114.114.114: icmp_seq=2 ttl=56 time=31.250 ms
--- 114.114.114.114 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 31.130/31.227/31.302/0.072 ms

This result shows that the latency is indeed much lower. Let’s continue executing the following command to change the DNS server and then execute the nslookup command again:

/# echo nameserver 114.114.114.114 > /etc/resolv.conf
/# time nslookup time.geekbang.org
Server:		114.114.114.114
Address:	114.114.114.114#53

Non-authoritative answer:
Name:	time.geekbang.org
Address: 39.106.233.176

real    0m0.064s
user    0m0.007s
sys     0m0.006s

You can see that now the resolution only takes 64ms, which is much better than the previous 10s.

So far, the problem seems to be solved. However, if you run the nslookup command multiple times, you probably won’t get good results every time. For example, on my machine, it often takes 1s or even longer.

/# time nslookup time.geekbang.org
Server:		114.114.114.114
Address:	114.114.114.114#53

Non-authoritative answer:
Name:	time.geekbang.org
Address: 39.106.233.176

real    0m1.045s
user    0m0.007s
sys     0m0.004s

A 1s DNS resolution time is still too long and unacceptable for many applications. So, how to solve this problem? I guess you have already thought of using DNS caching. This way, the DNS server will only be requested for the first query, and subsequent queries will use the records in the cache as long as the DNS records have not expired.

However, please note that the mainstream Linux distributions, except for the latest version of Ubuntu (such as 18.04 or newer versions), do not configure DNS caching automatically.

So, to enable DNS caching for the system, you need to do additional configuration. For example, the simplest method is to use dnsmasq.

dnsmasq is one of the most commonly used DNS caching services and is also often used as a DHCP server. It is easy to install and configure, and its performance can meet the needs of most applications for DNS caching.

Let’s continue in the previous container terminal and execute the following command to start dnsmasq:

/# /etc/init.d/dnsmasq start
 * Starting DNS forwarder and DHCP server dnsmasq                    [ OK ]

Then, modify /etc/resolv.conf and change the DNS server to the listening address of dnsmasq, which is 127.0.0.1. Then, re-execute the nslookup command multiple times:

/# echo nameserver 127.0.0.1 > /etc/resolv.conf
/# time nslookup time.geekbang.org
Server:		127.0.0.1
Address:	127.0.0.1#53

Non-authoritative answer:
Name:	time.geekbang.org
Address: 39.106.233.176

real	0m0.492s
user	0m0.007s
sys	0m0.006s

/# time nslookup time.geekbang.org
Server:		127.0.0.1
Address:	127.0.0.1#53

Non-authoritative answer:
Name:	time.geekbang.org
Address: 39.106.233.176

real	0m0.011s
user	0m0.008s
sys	0m0.003s

Now you can see that the first resolution is slow, taking 0.5s, but each subsequent resolution is fast, only taking 11ms. And the time needed for each DNS resolution afterwards is also stable.

At the end of the case, don’t forget to execute the “exit” command to exit the container terminal. Docker will automatically clean up the case container.

Conclusion #

Today, I took you through the basic principles of DNS and used several cases to help you master the analysis and problem-solving approach when encountering DNS resolution issues.

DNS is the most fundamental service in the internet, providing a query service for the mapping relationship between domain names and IP addresses. Many applications did not consider DNS resolution issues during initial development, and it would take several days to troubleshoot and discover that the slow DNS resolution was the cause for the issues.

Imagine if a web service interface needs to wait 1 second each time for DNS resolution. No matter how you optimize the underlying logic of the application, the response of this interface is always too slow for users, because the response time will always be greater than 1 second.

Therefore, in the process of application development, we must consider the potential performance problems caused by DNS resolution and master common optimization methods. Here, I summarize several common DNS optimization methods.

  • Cache the results of DNS resolution. Caching is the most effective method, but it is important to note that once the cache expires, you still need to retrieve the new records from the DNS server. However, this is acceptable for most applications.

  • Prefetch the results of DNS resolution. This is the most commonly used method in browsers and other web applications. It means that the browser will automatically resolve the domain names in the background and cache the results without waiting for the user to click on the hyperlinks on the page.

  • Use HTTPDNS instead of conventional DNS resolution. This is a method that many mobile applications choose, especially considering the prevalence of domain hijacking. By bypassing the DNS servers in the network with the HTTP protocol, you can avoid the issue of domain hijacking.

  • Global Server Load Balancing (GSLB) based on DNS. This not only provides load balancing and high availability for services, but also returns the IP address closest to the user’s location.

Reflection #

Lastly, I would like to discuss with you the DNS issues you have encountered. What types of DNS issues have you come across? How did you troubleshoot them, and what methods did you use to resolve them? You can summarize your thought process while incorporating the knowledge you gained today.

Feel free to engage in a discussion with me in the comments section, and also feel free to share this article with your colleagues and friends. Let’s practice in real-world scenarios and improve through communication.