DNS Troubleshooting Collection

DNS Troubleshooting Collection

When I first learned about DNS, I thought it couldn't be that complicated. It's just some DNS records stored on a server. What's the big deal?

But textbooks only explain how DNS works, but they don't tell you how many ways DNS can break your system in practice. This isn't just a caching problem!

So I posted a question on Twitter asking for DNS problems people were having, especially those that didn’t seem to have anything to do with DNS at first. (“Always DNS problems” meme)

I’m not going to discuss how to solve or avoid these problems in this post, but I will include some links to places where you can find solutions to the problems.

Problem: Slow network requests

If your network is slower than expected, it's because something is causing the DNS resolver to be slow. This could be due to things like the resolver being overloaded or having a memory leak.

I had this problem with my router's DNS forwarder, which made all my DNS requests very slow. I fixed it by restarting my router.

Problem: DNS timeout

Some users mentioned that their network requests took more than 2 seconds or even 30 seconds due to DNS query timeouts. This is similar to the "slow network requests" issue, but worse because the DNS request will consume several seconds.

Sophie Haskins has a blog post about Kubernetes DNS timeouts A Kube DNS pitfall experience.

Problem: ndots settings

Some netizens mentioned that there will be problems when setting ndots:5 in /etc/resolv.conf .

Below is the /etc/resolv.conf file referenced from this article: Why setting ndots:5 in /etc/resolv.conf in Kubernetes container pods will slow down your application performance.

 nameserver 100.64.0.10 search namespace.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal options ndots:5

If you use the above configuration file and want to query the domain name google.com , your program will call the getaddrinfo function, which will query the following domain names in turn:

  1. google.com.namespace.svc.cluster.local.
  2. google.com.svc.cluster.local.
  3. google.com.cluster.local.
  4. google.com.eu-west-1.compute.internal.
  5. google.com.

Basically, it checks if google.com is a subdomain of the search line.

Therefore, each time you initiate a DNS query, you have to wait until the first four queries fail before you can get the final query result.

Problem: Difficulty determining the DNS resolver used by the system

This isn't a problem in itself, but when you have DNS problems, it's usually something to do with the DNS resolver. I don't have a one-size-fits-all method for determining the DNS resolver.

Here are the methods I know of:

  • On Linux systems, the most common way to select the DNS resolver is through /etc/resolv.conf . However, there are exceptions, such as browsers that may ignore /etc/resolv.conf and use DNS over HTTPS instead.
  • If you are using UDP DNS, you can see where the DNS requests are being sent with sudo tcpdump port 53 But if you are using DNS over HTTPS or DNS over TLS, this method will not work.

I vaguely remember this being more confusing on MacOS, and I'm not sure why.

Problem: DNS server returns NXDOMAIN instead of NOERROR

This is a problem I once encountered where Nginx could not resolve the domain name.

  • I set Nginx to use a specific DNS server to resolve DNS queries
  • When accessing this domain, Nginx makes two queries, the first for A and the second for AAAA .
  • For A 's query, the DNS server returns NXDOMAIN
  • Nginx thinks the domain name does not exist and gives up the query
  • The DNS server returned a success response for the AAAA query.
  • But Nginx ignores the query result returned for AAAA because it has already given up the query.

The problem was that the DNS server was supposed to return NOERROR - the domain existed, it just didn't have an A record for it. I reported the problem and they fixed it.

Having written this problem myself, I understand why this happens - it's easy to assume that "there is no record to query, so NXDOMAIN error code should be returned".

Problem: Automatic DNS caching

If you visit a domain before the DNS record for that domain has been generated, the absence of the record will be cached. This can be quite surprising the first time you encounter it - I only learned about it last year.

The cached TTL is the TTL of the domain's Start of Authority (SOA) record - for example, for jvns.ca , this value is one hour.

Problem: Nginx caches DNS records forever

If you use the following configuration in Nginx:

 location / { proxy_pass https://some.domain.com; }

Nginx will only resolve some.domain.com once when it starts, and will never resolve it again. This is a very dangerous operation, especially for domains whose IP addresses change frequently. It may run smoothly for several months, and then suddenly wake you up from bed at 2 am one morning.

There are many well-known solutions to this problem, but since this article is not about Nginx, I am not going to go into it. But it will definitely surprise you the first time you encounter it.

Here is a blog post about this issue happening with AWS load balancers.

Problem: Java caches DNS records forever

A similar problem to the above, but it only occurs on Java: It is said that this is related to your Java configuration. "The default TTL setting of the JVM may cause the DNS record to be refreshed only when the JVM is restarted."

I haven't encountered this problem yet, but my friends who often write Java have encountered this problem.

Of course, any software can have issues with permanently caching DNS, but I've heard it often happens with Nginx and Java.

Problem: Forgotten /etc/hosts records

This is another caching problem: the records in /etc/hosts override your normal DNS settings!

What's confusing is that the dig command ignores the /etc/hosts file. So when you use dig whatever.com to query DNS information, it will tell you that everything is fine.

Problem: Email not sent/going to spam

Email is sent and authenticated via DNS (MX records, SPF records, DKIM records), so some email issues are actually DNS issues.

Problem: Invalid for internationalized domain names

You can register domain names using non-ASCII characters or even emoticons, such as 拉屎网https://💩.la.

DNS can handle internationalized domain names because 💩.la will be encoded using punycode and converted to xn--ls8h.la .

Although there are standards for handling internationalized domain names in the DNS, many software do not handle internationalized domain names well. Julian Squires's Get rid of emojis in Chrome is a very interesting example.

Problem: TCP DNS is blocked by the firewall

Someone mentioned that some firewalls will allow UDP on port 53, but block TCP. However, many DNS queries require TCP on port 53, which can cause intermittent problems that are difficult to troubleshoot.

Problem: musl does not support TCP DNS

Many applications use libc 's getaddrinfo to do DNS queries. musl is a glibc replacement used in Alpine Docker containers. It does not support TCP DNS. If the response data of your DNS query exceeds the size of the DNS UDP packet (512 bytes), there will be problems.

I'm still not sure about this, and my understanding below may be wrong:

  1. musl 's getaddrinfo initiates a DNS request
  2. The DNS server found that the response data of the request was too large to fit in a single DNS packet.
  3. The DNS server returns an empty truncated response and expects the client to re-initiate the query via TCP DNS.
  4. But musl does not support TCP DNS, so it will not retry at all

Article about this problem: DNS resolution issues on Alpine Linux.

Problem: getaddrinfo does not support round-robin DNS

Round robin DNS is a load balancing technique, where each DNS query will get a different IP address. Obviously, if you use gethostbyname to do DNS queries, there will be no problem, but it will not work if you use getaddrinfo , because getaddrinfo will sort the IP addresses obtained.

You may not realize at all that this may cause load balancing problems when you switch from gethostbyname to getaddrinfo .

This problem can be very subtle, and if you are not programming in C, these function calls are hidden behind various call libraries, and you may not be aware of the change at all. So a seemingly harmless upgrade may cause your DNS load balancing to fail.

Here are some articles discussing this:

  • getaddrinfo causes DNS polling to fail
  • getaddrinfo, round-robin DNS and happy eyeballs algorithm

Problem: Race condition when starting the service

Someone mentioned a problem with Kubernetes DNS: they had two containers that started at the same time, and immediately tried to resolve each other's address. Since Kubernetes DNS had not changed, the DNS query would fail. This failure would be cached, so subsequent queries would continue to fail.

Final Thoughts

I've only listed the tip of the DNS iceberg, and I'm looking forward to hearing about other issues and links that I haven't mentioned. I'd like to know how these issues actually occur and how they can be resolved.

<<:  Benefits of Fiber Optic Networks: Learn What It Is and Why It Matters

>>:  Five-minute technical talk | Semantic communication technology helps build a safe countryside

Recommend

...

Communication protocol I2C subsystem Debug

There are two common I2C errors: I2C ACK error, I...

TmhHost Hong Kong CN2 high-defense server online and simple test

TmhHost recently launched the Hong Kong CN2 high-...

Japan's strategy to compete for world 6G technology

In March of this year, when the COVID-19 epidemic...

Simplifying the Complexity: A Detailed Explanation of Computer Network Layers

Today we will talk about why computer networks ar...

[11.11] RackNerd: $11.11/year - 1.11GB/11GB/3TB/San Jose and other data centers

RackNerd has also released several Double 11 prom...

What is a patch panel and what is it used for?

Patch panels are important network components tha...

5G: What it means and why we'll never need 6G

The launch of 5G isn’t all that far away, with ro...

The Complete Guide to WiFi Penetrating Walls

[[250378]] 1. WiFi Penetration Through Walls: Que...

Bluetooth has been used for so long, why hasn't it been replaced?

When it comes to Bluetooth technology, most peopl...