Categories
Admin

The IT Detective Agency: New AIX Server working really slowly

Intro
Don’t ask me why anyone would willingly run IBM AIX, but it happens. And when they do, watch out for network punishment. We dealt with such a case, unfortunately, and we ran into a serious, somewhat obscure network issue and figured out the solution (we think). Maybe someone else can learn from this painful experience. Or maybe we’ll completely forget what we ourselves have done two years from now and find ourselves stepping on the same rake.

The details

So this new AIX server was configured to run an very old application, WebMethods, that makes a lot of database connections as well as connections to external partners for document exchange.

This had been working fine on the old AIX server, but we switched to newer hardware. As much as possible the old configurations were used. Yet this new server just couldn’t keep up with the load. Its queue started building up, connections to the database climbed into the hundreds, and then it just seemed like it was doing nothing at all.

Someone lent me root access so I can join the debugging party. What, no bash shell! Not even a properly configured korn shell. And everything’s just a little different on AIX – nothing is quite how you are accustomed to it. But at least it has tcpdump. I guess they also have their own AIXish utility as well, but I never bothered with that. tcpdump seemed to work. So I quickly began to get a feel from what the application folks were saying about their transfers which weren’t going outbound, and only slowly going inbound. They used port 5443 on one of the interfaces, en5:

# tcpdump -i en5 -n port 5443

And, true, not much was going on.

This went on for a day and things were looking desperate – to the point where we decided to go back to the old hardware! But we never stopped thinking.

Check the traffic to the Oracle database:

# tcpdump -i en0 -n port 1521

No, not that much either.

Try to check system logs, but who knows where those are? The ones I found had absolutely nothing of interest.

Being a DNS guy, I decided to check for DNS traffic:

# tcpdump -i en0 -n udp

(everything else uses tcp so I could get away with this).

Now DNS turned out to be quite chatty – around a dozen entries per second. And a lot of repetition. And a lot of IPv6 queries, labelled as AAAA?. I didn’t like it.

And this jogged my memory. I remembered encountering these IPv6 queries and wanting to turn them off on the old AIX servers. But how to do that??

As in all things, Google (actually DuckDuckGo) is your friend. You modify the /etc/netsvc.conf file. You need an entry like this:

hosts = local4, bind4

To be continued…

Categories
DNS IT Operational Excellence Network Technologies Uncategorized

Google’s DNS Servers Rock!

Intro
DNS is the Domain name Service, the Internet service that converts IP addresses, e.g., 200.54.129.57 into mnemonic names like www.mysite.com.

I tried to run a cache-only DNS server for use by a proxy server. What I found is that certain sites were not accessible on a frequent basis. I think uol.com.br is one of the problem sites (need to check this). It may not mean much to a US audience, but it’s really popular in Brazil!

At some point I happened to learn that Google has a public DNS service. This is worth pondering. No one of any repute has offered a DNS service to that point. There are a host of concerns about security, especially DNS cache poisoning. They blazed a trail, and did it in a way only Google and very few other major infrastructure players could. Not only did they offer a DNS service, they put their DNS servers all over the Internet and created convenient anycast addresses for their servers.

I am no expert on anycast addresses. You can look it up on Wikipedia, however. The essence for my purposes is that with a single IP address you’re going to hit the closest server, network-wise. So no matter where you are some Google DNS server is not far away. Try it. The anycast addresses are 8.8.8.8 and 8.8.4.4. They don’t mind, really! You can ping them. Traceroute to them, whatever. From the Amazon cloud Northeast 8.8.8.8 responds to PINGs in 3.4 ms. That’s really low. Not so low as to make me think they are in the same data center (it is different companies after all), but not far away.

The gold standard for running a DNS service is BIND. I have been running it for many years now and I want to give the Internet Software Consortium their due for providing this wonderful application. Once I got wind of my DNS difficulties as mentioned above, I had to wonder why not everyone else was complaining? They had to be using something else. I ran a flat-out performance test. 5000 queries from an actual proxy log, fed straight to my BIND DNS server, and then to Google’s DNS server 8.8.8.8. I have to dig up the numbers, but Google’s won by quite a bit! This result was actually surprising because you’re always going off-site to the Google DNS server, whereas my server can build up its cache and is right on my network. From where I tested the Google server was about 11 ms away. So 5000 x 11 ms = 55 s. So there is a 55 s handicap from just network considerations alone! Yet it is faster. On the quickest of queries the local server is indeed faster, but what happens is that over the course of real life queries, you always get a few problematic ones which either time out or just seem to take a long time to get back a response. That’s what kills the traditional DNS server and where Google has (obviously) made some optimizations.

And, that’s not all! Google also deals in a more forgiving fashion with broken domain names. I used to get on my high horse and proclaim to others about how broken their DNS servers are – it’s no wonder I can’t resolve their names, which means, by the way, I also cannot get to their web site nor send them email!

It’s effectively like taking yourself off the Internet, or so I thought. Turns out in some cases that’s only true if you’ve constrained yourself to resolving names with BIND. You see, BIND enforces the rules. And I’m a believer in rules. The Internet has about 5,000 technical rules called RFCs. DNS is a topic of many of these rules. The Internet could only have expanded to the size it currently has because all the major players agreed to abide by those rules. What Google has done with their server, in effect, is to say, “Well, if you don’t follow the rules, we’re going to try to work with you anyways.”

Here’s a concrete example. appliedcoatings.org. I guess at some point they’ll actually fix their severely broken DNS, but at the time I write this, August 21, 2011, these comments are valid and their domain is severely broken. In fact, I was amazed that people weren’t jumping up and down screaming at them. I couldn’t even send an email to them. That’s akin to knocking yourself off the Internet, right? Ah, but it all depends on whose DNS servers you are using!

There used to be lots of good free DNS analyzers, like dnsreport.com. You can still find a few around. www.zonecheck.fr, for instance. It shows FAILURE. If it were better written it would show the real problem, which is a lame delegation. But we’re experts, and we don’t need such tools! We will do the queries ourselves and show the lame delegation. We start by learning who are the authoritative nameservers for .ca, the top-level domain used in Canada:

 dig ns ca

; <<>> DiG 9.7.1-P2 <<>> ns ca
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52928
;; flags: qr rd ra; QUERY: 1, ANSWER: 10, AUTHORITY: 0, ADDITIONAL: 1

;; QUESTION SECTION:
;ca.                            IN      NS

;; ANSWER SECTION:
ca.                     83585   IN      NS      a.ca-servers.ca.
ca.                     83585   IN      NS      c.ca-servers.ca.
ca.                     83585   IN      NS      e.ca-servers.ca.
ca.                     83585   IN      NS      f.ca-servers.ca.
ca.                     83585   IN      NS      j.ca-servers.ca.
ca.                     83585   IN      NS      k.ca-servers.ca.
ca.                     83585   IN      NS      l.ca-servers.ca.
ca.                     83585   IN      NS      m.ca-servers.ca.
ca.                     83585   IN      NS      z.ca-servers.ca.
ca.                     83585   IN      NS      sns-pb.isc.org.

;; ADDITIONAL SECTION:
a.ca-servers.ca.        83594   IN      A       192.228.27.11

Now we ask one of them about the nameservers for appliedcoatings.ca:

 dig ns appliedcoatings.ca @a.ca-servers.ca.

; <<>> DiG 9.7.1-P2 <<>> ns appliedcoatings.ca @a.ca-servers.ca.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 288
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      NS

;; AUTHORITY SECTION:
appliedcoatings.ca.     86400   IN      NS      sp2.domainpeople.com.
appliedcoatings.ca.     86400   IN      NS      sp1.domainpeople.com.

So far everything's cool. Now, since the authoritative flag (AA) was not present in that response we re-ask that query, but now to one of the nameservers that's supposed to be authoritative for that domain:

dig ns appliedcoatings.ca @sp2.domainpeople.com.

; <<>> DiG 9.7.1-P2 <<>> ns appliedcoatings.ca @sp2.domainpeople.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24373
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      NS

;; ANSWER SECTION:
appliedcoatings.ca.     86400   IN      NS      ns1.domainpeople.com.
appliedcoatings.ca.     86400   IN      NS      ns2.domainpeople.com.

Oh, oh. That's not supposed to happen. We're getting back an entirely different set of nameservers. That's a lame delegation. The domain should be considered completely broken. I think even BIND might be forgiving up to this point. a BIND resolver does these types of quesires to get at the answer. At this point it says, "OK, this is strange, but not necessariily fatal. I will ask my subsequent queries to ns1.domainpeople.com and ns2.domainpeople.com since they are listed as being the nameservers of record.

So now let's get to something useful: looking up the mail exchanger record so we see how to deliver mail to this domain. BIND, which has been fastidiously following the rules, does it as follows:

dig mx appliedcoatings.ca @ns1.domainpeople.com.

; <<>> DiG 9.7.1-P2 <<>> mx appliedcoatings.ca @ns1.domainpeople.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 49996
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      MX

;; Query time: 79 msec
;; SERVER: 204.174.223.72#53(204.174.223.72)
;; WHEN: Sun Aug 21 19:05:43 2011
;; MSG SIZE  rcvd: 36

That's not good. Status is REFUSED. But BIND can even forgive this slight. There is one more nameserver to try after all, right? Last chance query:

dig mx appliedcoatings.ca @ns2.domainpeople.com.

; <<>> DiG 9.7.1-P2 <<>> mx appliedcoatings.ca @ns2.domainpeople.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 44404
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      MX

;; Query time: 72 msec
;; SERVER: 64.40.96.140#53(64.40.96.140)
;; WHEN: Sun Aug 21 19:07:34 2011
;; MSG SIZE  rcvd: 36

Status also REFUSED. Now we are really and truly dead. If you are using a BIND nameserver you have no way to send email to someone@appliedcoatings.ca. But not so with Google!

Of course I don't know how Google wrote their DNS server, but I do think that some of their infrastructure experts write it themselves rather than using open source programs. So with a Google nameserver you will get a response:

dig mx appliedcoatings.ca @8.8.8.8

; <<>> DiG 9.7.1-P2 <<>> mx appliedcoatings.ca @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6901
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      MX

;; ANSWER SECTION:
appliedcoatings.ca.     82805   IN      MX      10 mail.appliedcoatings.ca.

;; Query time: 4 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sun Aug 21 19:11:14 2011
;; MSG SIZE  rcvd: 57

and just to close the loop and make sure this is a valid host you would do this:

dig mail.appliedcoatings.ca @8.8.8.8

; <<>> DiG 9.7.1-P2 <<>> mail.appliedcoatings.ca @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35190
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;mail.appliedcoatings.ca.       IN      A

;; ANSWER SECTION:
mail.appliedcoatings.ca. 86400  IN      A       66.183.21.181

And we can go the next step and begin an SMTP conversation with that server to make sure it is really operating. After all, if they messed up DNS there's no telling what else they might have gotten wrong.

 telnet  66.183.21.181 25
Trying 66.183.21.181...
Connected to 66.183.21.181.
Escape character is '^]'.
220 mail.appliedcoatings.ca Microsoft ESMTP MAIL Service, Version: 6.0.3790.4675 ready at  Sun, 21 Aug 2011 16:22:04 -0700
HELO localhost
250 mail.appliedcoatings.ca Hello [50.17.188.196]
quit
221 2.0.0 mail.appliedcoatings.ca Service closing transmission channel
Connection closed by foreign host.

Yup. They've got an operating mail server at that IP.

So we can reverse engineer a bit what Google's DNS server must have done behind the scenes to arrive at a valid answer where BIND could not. I'm 100% sure that Google would have also done the query

dig mx appliedcoatings.ca @ns1.domainpeople.com

since that is the right thing to do. But not getting a satisfactory answer (status: REFUSED), what it must do additionally after getting refused a second time by ns2.domainpeople, is to go back to the originally named nameservers sp1 and sp2. Watch what happens in that case:

 dig mx appliedcoatings.ca @sp1.domainpeople.com.

; <<>> DiG 9.7.1-P2 <<>> mx appliedcoatings.ca @sp1.domainpeople.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10226
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      MX

;; ANSWER SECTION:
appliedcoatings.ca.     86400   IN      MX      10 mail.appliedcoatings.ca.

;; AUTHORITY SECTION:
appliedcoatings.ca.     86400   IN      NS      ns1.domainpeople.com.
appliedcoatings.ca.     86400   IN      NS      ns2.domainpeople.com.

;; ADDITIONAL SECTION:
mail.appliedcoatings.ca. 86400  IN      A       66.183.21.181

The AA (authoritative) flag is set in the response. So it's a good response, but sent to the "wrong" nameserver. Nevertheless, it is a response and it gets anyone using that nameserver more functionality than someone using BIND.

Conclusion
So far we've got three advantages speaking favorably for Google's DNS server: it's faster, it's answers are more complete and it's universally available. Wait, there's more! Another nice thing is what it does not do. Some ISPs have a "feature" I call DNS clobbering. In fact it's so annoying I will devote a whole blog post to describing it in more detail. Essentially they take license with DNS and make up answers to some queries! It's true and it's truly annoying. Not all ISPs do this but mine certainly does. So the other nice thing about Google DNS is that it does not do DNS clobbering and it's available for you to use it at home and avoid this annoying feature. You just set your DNS servers rather than have them assigned automatically via DHCP.

Other Resources
I should mention that while researching public DNS servers I was also led to commercial versions of the same thing. I went so far as to test the timings on one of those services and found that it is more distant, round-trip-wise, than Google's anycast server. Stands to reason. Google's got the best Internet access of anyone. They're on all the major highways. The commercial offerings have some additional cool features, however. They can serve as URL filter. So if someone puts in a URL which leads to a malicious site, for example, they can respond with an answer that spares you from going to that infected site. This is a little more crude than URL filtering at the proxy level, since a DNS server has no knowledge of the URI whereas a proxy URL filter does, but it could be quite serviceable. I'm not sure it allows you to pick and choose URL categories to block as with a URL filter (gambling, porn, hacking sites, etc.).

A lot more information on using Google DNS is at http://code.google.com/speed/public-dns/docs/using.html.

September 1 Update - a Crack in the Infrastructure
I now have my first case of a domain name which Google DNS did not resolve correctly, and for no apparent reason. The domain name is forums.tweaktown.com. Here's proof of Google's failure, followed immediately by Amazon's DNS servers' success:

dig forums.tweaktown.com @8.8.8.8

; <<>> DiG 9.7.1-P2 <<>> forums.tweaktown.com @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 15826
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;forums.tweaktown.com.          IN      A

;; AUTHORITY SECTION:
tweaktown.com.          116     IN      SOA     ns21.domaincontrol.com. dns.jomax.net. 2011060602 28800 7200 604800 86400

;; Query time: 4 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Thu Sep  1 14:40:50 2011
;; MSG SIZE  rcvd: 106


 dig forums.tweaktown.com

; <<>> DiG 9.7.1-P2 <<>> forums.tweaktown.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52290
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1

;; QUESTION SECTION:
;forums.tweaktown.com.          IN      A

;; ANSWER SECTION:
forums.tweaktown.com.   1885    IN      A       38.101.21.25

;; AUTHORITY SECTION:
tweaktown.com.          1943    IN      NS      ns22.domaincontrol.com.
tweaktown.com.          1943    IN      NS      ns21.domaincontrol.com.

;; ADDITIONAL SECTION:
ns21.domaincontrol.com. 753     IN      A       216.69.185.11

;; Query time: 0 msec
;; SERVER: 172.16.0.23#53(172.16.0.23)
;; WHEN: Thu Sep  1 14:40:55 2011
;; MSG SIZE  rcvd: 122

All BIND servers I tried during this time returned the correct answer.

Is this an isolated incident or a tip of an iceberg of problems? I hope it is a one-off. I'll post updates as I find out more. I am slightly concerned now.

References and related
I finally wrote my own web interface to DNS and published the code I did it with. Check it out here.

A web interface to Google's public DNS service, which will give you more debug information, is https://dns.google.com/