Categories
DNS Linux TCP/IP

The IT Detective Agency: the case of the slow dns server responses to tcp

Intro

This case was solved today. Now I just need to find the time to write it up!

I belong to a team which runs many dozens of dns servers. We have basic but thorough monitoring of these servers using both Zabbix and Thousandeyes. One day I noticed a lot of timeout alerts so I began to look into it. One mystery just led to another without coming any closer to a true root cause. There were many dead ends in the hunt. Finally our vendor came through and discovered something…

The details

The upshot are these settings we arrived at for an ISC BIND server:

   tcp-listen-queue 200;
   tcp-clients 600;
   tcp-idle-timeout 10;

This is in the options section of the named.conf file. That’s it! This is on a four-core server with 16 GB RAM. The default values are:

tcp-listen-queue: 10

tcp-clients: 10

tcp-idle-timeout: 60 seconds

Those defaults will kill you on any reasonably busy server, meaning, one which gets a couple thousand requests per second.

To be continued…

Conclusion

We encountered a tough situation on our ISC BIND DNS servers. TCP queries, and only TCP queries, were responded to slowsly at best or not at all. after many flase starts we found the solution was setting three tcp parameters in the options section of the configuration file, tcp-listen-queue, tcp-clients and tcp-idle-timeout. We’ve never had to mess with those parameters after literally decades of running ISC BIND. Yet we have incontrovertible proof that that is what was needed.

Case: closed!

References and related

A great and very detailed discussion of this type of TCP backlog issues on Redhat systems is found here: https://access.redhat.com/solutions/30453