Categories
DNS Network Technologies

The IT Detective Agency: The mystery of the bad TE DNS server test

Intro

One of my colleagues, someone who does not moonlight as a DNS team member unlike myself, set up what shuold be a simple DNS test in ThousandEyes Network and App Synthetics. This was in response to a customer complaint that resolution of a particular FQDN occasionally failed. Well, the test results didn’t look right. They indicated very spotty resolution of the FQDN.

So often I’ve come to take TE results with a grain of salt, as though you need an asterisk behind each result so you self-train to ignore most of it and look for mainly secular trends in the hopes they might be meaningful.

The details

This picture sums it up.

A DNS test in ThousandEyes

It’s not looking to good, right? But come on, this is DNS – about as simple s it gets. Either it’s working or something is seriously wrong.

Well, I have access to both ends of the conversation so I do what any network engineer would do and I start to take packet traces on both the source – agent in TE-speak – and destination – one of the DNS servers being tested and showing an error.

For the record, the particular error is

RCODE:0 – No resource records of type A found

Funny thing. After I started taking my traces, I noticed the communication began to work in TE. So I switch to another broken DNS server and set up and then start the traces. And then it too starts working in TE. Finally, I run out of broken DNS servers and all of them are now working! What the…

A little misdirection

So I opened a support case mentioning this strange phenomenon. You never know with support. But TE has made it exceedingly easy to create cases – you never have to leave the app which is super convenient. I marked it with the lowest possible priority since it was mostly a matter of curiosity at this point.

However within a few hours I got a response. Those tests weren’t sending recursive queries. You can turn on recursive queries in the Advanced section.

So what was going on? I fooled you as I fooled myself with a little misdirection and I failed to tell you all the relevant facts. But how can you when you don’t know them yourself? In fact I threw in some additional red herrings into the ticket. I stated that this FQDN is an internal domain name and not resolveable on the Internet. I felt that that might have something to do with it. Wrong. Whenever I started those traces I also ran a dig by hand to make sure that by hand the dns server really could resolve that domain name. And it always could. Then I would fire up the packet traces. Well, dig by default has recursive query enabled by default. In fact, and I’ve never used this until today, you can only turn it off by adding the +norecurse switch! Who knew?

So I was filling that DNS server’s cache with the answer by sending a single recursive query, and it had a fairly long, TTL, about six hours. But then it would revert back to the original bad behavior.

So in that picture above, at the end you see availability has climbed to 100% in the last measurement all the way to the right? I changed the test to do a recursive query and all is good now.

Case: closed.

Conclusion

Although I still feel ThousandEyes produces many test results with artifacts which have to be discounted, at least with regards to a simple DNS resolution test, it probably can be trusted as long as you know what you are doing and in the advanced properties of the test, check the Send recursive queries when you are testing on a recursive server.

References and related

More info concerning ThousandEyes: https://www.thousandeyes.com/

Leave a Reply

Your email address will not be published. Required fields are marked *