Categories
Admin Network Technologies

The IT Detective Agency: virus updates are failing

Intro
This case hardly qualifies as worthy subject matter for the IT Detective Agency – it’s pretty run-of-the-mill stuff. But I wanted to document it for completeness and show how a problem in one thing can turn out to have an unexpected cause (At least to me. In hindsight it’s dead obvious what the issue was likely to have been).

The Situation
We have lots of servers at drjohns. So when one of our admins, Shake, said that one of them, nfuz01, can’t reach the Etrust serverto get its virus updates I had no recollection of what that server is or does. Shake asked if the firewall had changed recently. That’s sort of a tricky question because there are always minor changes being done. Most have absolutely no effect because they are additional rules providing new permissions. So I bravely answered No, it hadn’t. And I wondered what he meant in using the word “reach” anyways.

So I walk up to Shake’s desk to get a better idea. He said not only are updates not virus signature updates not occurring, but neither server can PING the other, neither by name nor by IP address. Now we’re getting somewhere. I still haven’t registered where nfuz01 is, but I know the firewall as I’ve set it up permits ICMP traffic to transit. I suggested that maybe nfuz01 had some missing or messed-up routes. Then I went back to my desk to think some more. That’s what gets me motivated – when I’ve publicly speculated about the root cause of something. It’s not so much that I may be proven wrong, but if I am wrong, I want to be the first to find out and issue a correction.

So I tried a PING from my desktop:

C:\>ping 10.91.12.14
 
Pinging 10.91.12.14 with 32 bytes of data:
Reply from 171.18.252.10: TTL expired in transit.
Reply from 171.18.252.10: TTL expired in transit.

I look up where nfuz01 is. It is in a secondary data center. I ping it from a server in that same data center, but one a different segment – it works fine! I ping it from a Linux server in my main data center – totally different results:

> ping 10.91.12.14
PING 10.91.12.14 (10.91.12.14) 56(84) bytes of data.
From 171.18.252.10 icmp_seq=1 Time to live exceeded
From 171.18.252.10 icmp_seq=2 Time to live exceeded
 
--- 10.91.12.14 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1000ms
 
> traceroute -n 10.91.12.14
traceroute to 10.91.12.14 (10.91.12.14), 30 hops max, 40 byte packets
 1  10.136.188.2  1.100 ms  1.523 ms  2.010 ms
 2  10.1.4.2  0.934 ms  0.941 ms  0.773 ms
 3  10.1.4.141  0.869 ms  0.926 ms  0.913 ms
 4  171.18.252.10  1.076 ms  1.096 ms  1.130 ms
 5  10.1.4.129  1.043 ms  1.029 ms  1.018 ms
 6  10.1.4.141  0.993 ms  0.611 ms  0.918 ms
 7  171.18.252.10  0.932 ms  0.916 ms  1.002 ms
 8  10.1.4.129  0.987 ms  1.089 ms  1.121 ms
 9  10.1.4.141  1.152 ms  1.246 ms  1.229 ms
10  171.18.252.10  2.040 ms  2.747 ms  2.735 ms
11  10.1.4.129  1.332 ms  1.418 ms  1.467 ms
12  10.1.4.141  1.477 ms  1.754 ms  1.685 ms
13  171.18.252.10  1.993 ms  1.978 ms  2.013 ms
14  10.1.4.129  1.930 ms  1.960 ms  2.039 ms
15  10.1.4.141  2.065 ms  2.156 ms  2.140 ms
16  171.18.252.10  2.116 ms  5.454 ms  5.453 ms
17  10.1.4.129  4.466 ms  4.385 ms  4.296 ms
18  10.1.4.141  4.266 ms  4.267 ms  4.260 ms
19  171.18.252.10  4.232 ms  4.216 ms  4.216 ms
20  10.1.4.129  4.182 ms  4.063 ms  4.009 ms
21  10.1.4.141  3.994 ms  3.987 ms  2.398 ms
22  171.18.252.10  2.400 ms  2.484 ms  2.690 ms
23  10.1.4.129  2.346 ms  2.449 ms  2.544 ms
24  10.1.4.141  2.534 ms  2.607 ms  2.610 ms
25  171.18.252.10  2.602 ms  2.742 ms  2.736 ms
26  10.1.4.129  2.776 ms  2.856 ms  2.848 ms
27  10.1.4.141  2.648 ms  3.185 ms  3.291 ms
28  171.18.252.10  3.236 ms  3.223 ms  3.235 ms
29  10.1.4.129  3.219 ms  3.277 ms  3.377 ms
30  10.1.4.141  3.363 ms  3.381 ms  3.449 ms

Cool, right? We’ve caught a network loop in the act. Now I know it isn’t the firewall, it isn’t the routes on nfuz01 but it is something with networking. So I sent that off to them….

In less than an hour I got the explanation as well as the fix:


All should be reachable again. There’s a loop I can’t clear amongst some [telecom-owned] routers in the main data center. I’ve superseded it with two /27s until they clear it.

And it pings fine now:

> ping 10.91.12.14
PING 10.91.12.14 (10.91.12.14) 56(84) bytes of data.
64 bytes from 10.91.12.14: icmp_seq=1 ttl=125 time=46.2 ms
64 bytes from 10.91.12.14: icmp_seq=2 ttl=125 time=40.1 ms
64 bytes from 10.91.12.14: icmp_seq=3 ttl=125 time=23.1 ms
 
--- 10.91.12.14 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 23.114/36.502/46.275/9.796 ms

And Shake says the updates came in.

Case closed!

Conclusion
Why wasn’t the problem more obvious to us from the very beginning? Well, if the admin who said nfuz01 couldn’t reach Etrust had tried to log in nfuz01 through the normal Remote Desktop mechanism – and of course failed – then we might have drilled down into a networking cause more quickly. But nfuz01 is a VM and he must have been logged on via VMWare Virtual Center and so he hadn’t noticed that basically the server couldn’t reach anywhere in our main data center. It is also an obscure server (remember that I had no recall about it?) so no one really noticed that it was effectivley out-of-business.

Categories
Admin IT Operational Excellence Network Technologies

The IT Detective Agency: the case of the Adobe form network issue

Intro
Sometimes IT is called in to fix things we know little or nothing about. We may fix it, still not know anything, except what we did to fix it, and move on. Such is the case here with a mysterious Adobe Form that wasn’t working when I was called in to a meeting to discuss it.

The Case
One of our developers created a simple Adobe Acrobat form document. It has some logic to ask for a username and password and then verify them against an LDAP server using a network connection. Or at least that was the theory. It worked fine and then we got new PCs running Windows 7 and it stopped working. They asked me for help.

I asked to get a copy of the form because I like to test on my desktop where I am free to try dumb things and not waste others’ time. Initially I thought I also had the error. They showed me how to turn on Javascript debugging in edit|preferences. The debug window popped up with something like this:

debug 5 : function setConstants
debug 5 : data.gURL = https://b2bqual.drjohnstechtalk.com/invoke/Drjohns_Services.utilities:httpPostToBackEnd
debug 5 : function today_ymd
debug 5 : mrp::initialize: version 0.0001 debug level = 5
debug 5 : Login clicked
debug 5 : calling LDAPQ
debug 5 : in LDAPQ
 
NotAllowedError: Security settings prevent access to this property or method.
SOAP.request:155:XFA:data[0]:mrp[0]:sub1[0]:btnLogin[0]:clic

But this wasn’t the real problem. In this form you get a yellow bar at the top along with this message and you give approval to the form to access what it needs. Then you run it again.

For me, then, it worked. I knew this because it auto-populated some user information fields after taking a few seconds.

So i worked with a couple people for whom it wasn’t working. One had Automatically detect proxy settings checked. Unfortunately the new PCs came this way and it’s really not what we want. We prefer to provide a specific PAC file. With the auto-detect unchecked it worked for this guy.

The next guy said he already had that unchecked. I looked at his settings and confirmed that was the case. However, in his case he mentioned that Firefox is his default browser. He decided to change it back to Internet Explorer. Then he tested and lo and behold. It began to work for him as well!

When it wasn’t working he was seeing an error:

NetworkError: A network connection could not be created.

Later he realized that in Firefox he also was using auto-detect for the proxy settings. When he switched that to Use System Settings all was OK and he could have FF as default browser and get this form to work.

Conclusion
This is speculation on my part. I guess that our new version of Acrobat Reader X, v 10.1.1, is not competent in interpreting the auto-detect proxy setting, and that it is also tripped up by the proxy settings in Firefox.

There’s a lot more I’d like to understand about what was going on here, but sometimes speed counts. The next problem is already calling my name…

Categories
Admin IT Operational Excellence Network Technologies

The IT Detective Agency: ARP Entry OK, PING not Working

Intro
Yes, the It detective agency is back by popular demand. This time we’ve got ourselves a thriller involving a piece of equipment – a wireless LAN controller, WLAN – on a directly connected network. From the router we could see the arp entry for the WLAN, but we could not PING it. Why?

A trace, or more correctly the output of tcpdump run on the router interface connected to that network, showed this:

>
12:08:59.623509  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:08:59.623530  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:01.272922  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:03.271765  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:05.271469  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:07.271885  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:09.271804  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:09.622902  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:09:09.622922  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:11.271567  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:13.271716  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:15.271971  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:17.040748  I b8:c7:5d:19:b9:9e (oui Unknown) > Broadcast Null Unnumbered, xid, Flags [Command], length 46
12:09:17.271663  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:19.271832  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:19.392578  I b8:c7:5d:19:b9:9e (oui Unknown) > Broadcast Null Unnumbered, xid, Flags [Command], length 46
12:09:19.623515  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:09:19.623535  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:20.478397  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:21.271714  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:23.271697  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:25.271664  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:27.272156  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:29.271730  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:29.621882  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:09:29.621903  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:31.271765  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:33.271858  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43

What’s interesting is what isn’t present. No PINGs. No unicast traffic whatsoever, yet we knew the WLAN was generating traffic. The frequent arp requests for the same IP strongly hinted that the WLAN was not getting the response. We were not able to check the arp table of the WLAN. And we knew the WLAN was supposed to respond to our PINGs, but it wasn’t. Yet the router’s arp table had the correct entry for the WLAN, so we knew it was plugged into the right switch port and on the right vlan. We also triple-checked that the network masks matched on both devices. Let’s go back. Was it really on the right vlan??

The Solution
What we eventually realized is that in the WLAN GUI, VLANs were assigned to the various interfaces. the switch port, on a Cisco switch, was a regular access port. We reasoned (documentation was scarce) that the interface was vlan tagging its traffic. So we tried to change the access port to a trunk port and enter the correct vlan. Here’s the show conf snippet:

interface GigabitEthernet1/17
 description 5508-wlan
 switchport
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan 887
 switchport mode trunk
 spanning-tree portfast edge trunk

Bingo! With that in place we could ping the WLAN and it could send us its traffic.

Case closed.

2018 update
I had totally forgotten my own posting. And I’ll be damned if in the heat of connecting a new firewall to a switch port we didn’t have this weird situation where we could see MAC entries of the firewall, and it could see MACs of other devices on that vlan, but nobody could ping the firewall and vica versa. A trace from tcpdump looked roughly similar to the above – a lot of arp who-has firewall, tell server. Sure enough, the firewall guy, new to the group, had configured all his ports to be tagged ports, even those with a single vlan. It had been our custom to make single vlans non-tagged ports. I didn’t start it, that’s just how it was. More than an hour was lost debugging…

And earlier in the year was yet another similar incident, where a router operated by a vendor joining one of our vlans assumed tagged ports where we did not. More than an hour was lost debugging… See a pattern there?

I had forgotten my own post from seven years ago to such an extent, I was just about to write a new one when I thought, Maybe I’ve covered that before. So old topics are new once again… Here’s to remember this for the next time!

Where to watch out for this
When you don’t run all the equipment. If you ran it all you’d have the presence of mind to make all the ports consistent.

Some terminology
A tagged port can also be known as using 802.1q, which is also known as dot1q, which in Cisco world is known as a trunk port. In the absence of that, you would have an access port (Cisco terminology) or untagged port (everyone else).

Conclusion
OK, there are probably many reasons and scenarios in which devices on the same network can see each other’s arp entries, but not send unicast traffic. But, the scenario we have laid out above definitely produces that effect, so keep it in mind as a possibility should you ever encounter this issue.

Categories
Admin IT Operational Excellence Network Technologies

Internet Service Providers Block TCP Port 22 or Do They?

Intro
The original premise of this article is that some Internet Service Providers were seen to block TCP port 22, used by ssh and sftp. However, as often happens during active IT investigations, this turns out to be completely wrong. In fact there was a block in this case we studied, but not by the ISPs. An overly aggressive ACL on the customer premise equipment Internet router is in fact the culprit.

The Problem
(IPs skewed to protect whatever) We asked a partner to do an sftp to drjohnstechtalk.com. All firewall and routing rules were in place. The partner tried it. He saw a SYN packet leaving, but no packets being returned. Here at drjohnstechtalk, we didn’t see any packets whatsoever! This partner makes sftp connections to other servers successfully. What the heck?

We had them try the following basic command:

nc -v host 22

where host is the IP of the target server. The response was:

nc: connect to host port 22 (tcp) failed: No route to host

But switching to port 21 (FTP) showed completely different behaviour: there was no message whatsoever and the session hanged. That’s good! That’s the usual firewalls dropping packets. But this No route to host needs more exploration.

Getting Closer
So we did an open trace. I mean a tcpdump without any limiting expression. The dump showed the SYN out to port 22, followed by this nugget:

13:09:14.279176 IP Sprint_IP > src_IP: ICMP host target_IP unreachable - admin prohibited filter, length 36

Next Steps
This well-intentioned filtering is causing a business problem. The Cisco IOS ACL that got them into trouble was this one:

ip access-list extended drop-spoof-and-telnet
 deny   tcp any any eq 22 log-input

Solution
They liked the idea of this filtering, but apparently this was the first request for inbound ssh access. So they decided to keep this filter rule but precede it with more specific rules as required, essentially acting like a second firewall:

ip access-list extended drop-spoof-and-telnet
 permit tcp host IP_src host IP_dest eq 22
  deny   tcp any any eq 22 log-input
Categories
DNS IT Operational Excellence Network Technologies Uncategorized

Google’s DNS Servers Rock!

Intro
DNS is the Domain name Service, the Internet service that converts IP addresses, e.g., 200.54.129.57 into mnemonic names like www.mysite.com.

I tried to run a cache-only DNS server for use by a proxy server. What I found is that certain sites were not accessible on a frequent basis. I think uol.com.br is one of the problem sites (need to check this). It may not mean much to a US audience, but it’s really popular in Brazil!

At some point I happened to learn that Google has a public DNS service. This is worth pondering. No one of any repute has offered a DNS service to that point. There are a host of concerns about security, especially DNS cache poisoning. They blazed a trail, and did it in a way only Google and very few other major infrastructure players could. Not only did they offer a DNS service, they put their DNS servers all over the Internet and created convenient anycast addresses for their servers.

I am no expert on anycast addresses. You can look it up on Wikipedia, however. The essence for my purposes is that with a single IP address you’re going to hit the closest server, network-wise. So no matter where you are some Google DNS server is not far away. Try it. The anycast addresses are 8.8.8.8 and 8.8.4.4. They don’t mind, really! You can ping them. Traceroute to them, whatever. From the Amazon cloud Northeast 8.8.8.8 responds to PINGs in 3.4 ms. That’s really low. Not so low as to make me think they are in the same data center (it is different companies after all), but not far away.

The gold standard for running a DNS service is BIND. I have been running it for many years now and I want to give the Internet Software Consortium their due for providing this wonderful application. Once I got wind of my DNS difficulties as mentioned above, I had to wonder why not everyone else was complaining? They had to be using something else. I ran a flat-out performance test. 5000 queries from an actual proxy log, fed straight to my BIND DNS server, and then to Google’s DNS server 8.8.8.8. I have to dig up the numbers, but Google’s won by quite a bit! This result was actually surprising because you’re always going off-site to the Google DNS server, whereas my server can build up its cache and is right on my network. From where I tested the Google server was about 11 ms away. So 5000 x 11 ms = 55 s. So there is a 55 s handicap from just network considerations alone! Yet it is faster. On the quickest of queries the local server is indeed faster, but what happens is that over the course of real life queries, you always get a few problematic ones which either time out or just seem to take a long time to get back a response. That’s what kills the traditional DNS server and where Google has (obviously) made some optimizations.

And, that’s not all! Google also deals in a more forgiving fashion with broken domain names. I used to get on my high horse and proclaim to others about how broken their DNS servers are – it’s no wonder I can’t resolve their names, which means, by the way, I also cannot get to their web site nor send them email!

It’s effectively like taking yourself off the Internet, or so I thought. Turns out in some cases that’s only true if you’ve constrained yourself to resolving names with BIND. You see, BIND enforces the rules. And I’m a believer in rules. The Internet has about 5,000 technical rules called RFCs. DNS is a topic of many of these rules. The Internet could only have expanded to the size it currently has because all the major players agreed to abide by those rules. What Google has done with their server, in effect, is to say, “Well, if you don’t follow the rules, we’re going to try to work with you anyways.”

Here’s a concrete example. appliedcoatings.org. I guess at some point they’ll actually fix their severely broken DNS, but at the time I write this, August 21, 2011, these comments are valid and their domain is severely broken. In fact, I was amazed that people weren’t jumping up and down screaming at them. I couldn’t even send an email to them. That’s akin to knocking yourself off the Internet, right? Ah, but it all depends on whose DNS servers you are using!

There used to be lots of good free DNS analyzers, like dnsreport.com. You can still find a few around. www.zonecheck.fr, for instance. It shows FAILURE. If it were better written it would show the real problem, which is a lame delegation. But we’re experts, and we don’t need such tools! We will do the queries ourselves and show the lame delegation. We start by learning who are the authoritative nameservers for .ca, the top-level domain used in Canada:

 dig ns ca

; <<>> DiG 9.7.1-P2 <<>> ns ca
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52928
;; flags: qr rd ra; QUERY: 1, ANSWER: 10, AUTHORITY: 0, ADDITIONAL: 1

;; QUESTION SECTION:
;ca.                            IN      NS

;; ANSWER SECTION:
ca.                     83585   IN      NS      a.ca-servers.ca.
ca.                     83585   IN      NS      c.ca-servers.ca.
ca.                     83585   IN      NS      e.ca-servers.ca.
ca.                     83585   IN      NS      f.ca-servers.ca.
ca.                     83585   IN      NS      j.ca-servers.ca.
ca.                     83585   IN      NS      k.ca-servers.ca.
ca.                     83585   IN      NS      l.ca-servers.ca.
ca.                     83585   IN      NS      m.ca-servers.ca.
ca.                     83585   IN      NS      z.ca-servers.ca.
ca.                     83585   IN      NS      sns-pb.isc.org.

;; ADDITIONAL SECTION:
a.ca-servers.ca.        83594   IN      A       192.228.27.11

Now we ask one of them about the nameservers for appliedcoatings.ca:

 dig ns appliedcoatings.ca @a.ca-servers.ca.

; <<>> DiG 9.7.1-P2 <<>> ns appliedcoatings.ca @a.ca-servers.ca.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 288
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      NS

;; AUTHORITY SECTION:
appliedcoatings.ca.     86400   IN      NS      sp2.domainpeople.com.
appliedcoatings.ca.     86400   IN      NS      sp1.domainpeople.com.

So far everything's cool. Now, since the authoritative flag (AA) was not present in that response we re-ask that query, but now to one of the nameservers that's supposed to be authoritative for that domain:

dig ns appliedcoatings.ca @sp2.domainpeople.com.

; <<>> DiG 9.7.1-P2 <<>> ns appliedcoatings.ca @sp2.domainpeople.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24373
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      NS

;; ANSWER SECTION:
appliedcoatings.ca.     86400   IN      NS      ns1.domainpeople.com.
appliedcoatings.ca.     86400   IN      NS      ns2.domainpeople.com.

Oh, oh. That's not supposed to happen. We're getting back an entirely different set of nameservers. That's a lame delegation. The domain should be considered completely broken. I think even BIND might be forgiving up to this point. a BIND resolver does these types of quesires to get at the answer. At this point it says, "OK, this is strange, but not necessariily fatal. I will ask my subsequent queries to ns1.domainpeople.com and ns2.domainpeople.com since they are listed as being the nameservers of record.

So now let's get to something useful: looking up the mail exchanger record so we see how to deliver mail to this domain. BIND, which has been fastidiously following the rules, does it as follows:

dig mx appliedcoatings.ca @ns1.domainpeople.com.

; <<>> DiG 9.7.1-P2 <<>> mx appliedcoatings.ca @ns1.domainpeople.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 49996
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      MX

;; Query time: 79 msec
;; SERVER: 204.174.223.72#53(204.174.223.72)
;; WHEN: Sun Aug 21 19:05:43 2011
;; MSG SIZE  rcvd: 36

That's not good. Status is REFUSED. But BIND can even forgive this slight. There is one more nameserver to try after all, right? Last chance query:

dig mx appliedcoatings.ca @ns2.domainpeople.com.

; <<>> DiG 9.7.1-P2 <<>> mx appliedcoatings.ca @ns2.domainpeople.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 44404
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      MX

;; Query time: 72 msec
;; SERVER: 64.40.96.140#53(64.40.96.140)
;; WHEN: Sun Aug 21 19:07:34 2011
;; MSG SIZE  rcvd: 36

Status also REFUSED. Now we are really and truly dead. If you are using a BIND nameserver you have no way to send email to someone@appliedcoatings.ca. But not so with Google!

Of course I don't know how Google wrote their DNS server, but I do think that some of their infrastructure experts write it themselves rather than using open source programs. So with a Google nameserver you will get a response:

dig mx appliedcoatings.ca @8.8.8.8

; <<>> DiG 9.7.1-P2 <<>> mx appliedcoatings.ca @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6901
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      MX

;; ANSWER SECTION:
appliedcoatings.ca.     82805   IN      MX      10 mail.appliedcoatings.ca.

;; Query time: 4 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sun Aug 21 19:11:14 2011
;; MSG SIZE  rcvd: 57

and just to close the loop and make sure this is a valid host you would do this:

dig mail.appliedcoatings.ca @8.8.8.8

; <<>> DiG 9.7.1-P2 <<>> mail.appliedcoatings.ca @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35190
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;mail.appliedcoatings.ca.       IN      A

;; ANSWER SECTION:
mail.appliedcoatings.ca. 86400  IN      A       66.183.21.181

And we can go the next step and begin an SMTP conversation with that server to make sure it is really operating. After all, if they messed up DNS there's no telling what else they might have gotten wrong.

 telnet  66.183.21.181 25
Trying 66.183.21.181...
Connected to 66.183.21.181.
Escape character is '^]'.
220 mail.appliedcoatings.ca Microsoft ESMTP MAIL Service, Version: 6.0.3790.4675 ready at  Sun, 21 Aug 2011 16:22:04 -0700
HELO localhost
250 mail.appliedcoatings.ca Hello [50.17.188.196]
quit
221 2.0.0 mail.appliedcoatings.ca Service closing transmission channel
Connection closed by foreign host.

Yup. They've got an operating mail server at that IP.

So we can reverse engineer a bit what Google's DNS server must have done behind the scenes to arrive at a valid answer where BIND could not. I'm 100% sure that Google would have also done the query

dig mx appliedcoatings.ca @ns1.domainpeople.com

since that is the right thing to do. But not getting a satisfactory answer (status: REFUSED), what it must do additionally after getting refused a second time by ns2.domainpeople, is to go back to the originally named nameservers sp1 and sp2. Watch what happens in that case:

 dig mx appliedcoatings.ca @sp1.domainpeople.com.

; <<>> DiG 9.7.1-P2 <<>> mx appliedcoatings.ca @sp1.domainpeople.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10226
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      MX

;; ANSWER SECTION:
appliedcoatings.ca.     86400   IN      MX      10 mail.appliedcoatings.ca.

;; AUTHORITY SECTION:
appliedcoatings.ca.     86400   IN      NS      ns1.domainpeople.com.
appliedcoatings.ca.     86400   IN      NS      ns2.domainpeople.com.

;; ADDITIONAL SECTION:
mail.appliedcoatings.ca. 86400  IN      A       66.183.21.181

The AA (authoritative) flag is set in the response. So it's a good response, but sent to the "wrong" nameserver. Nevertheless, it is a response and it gets anyone using that nameserver more functionality than someone using BIND.

Conclusion
So far we've got three advantages speaking favorably for Google's DNS server: it's faster, it's answers are more complete and it's universally available. Wait, there's more! Another nice thing is what it does not do. Some ISPs have a "feature" I call DNS clobbering. In fact it's so annoying I will devote a whole blog post to describing it in more detail. Essentially they take license with DNS and make up answers to some queries! It's true and it's truly annoying. Not all ISPs do this but mine certainly does. So the other nice thing about Google DNS is that it does not do DNS clobbering and it's available for you to use it at home and avoid this annoying feature. You just set your DNS servers rather than have them assigned automatically via DHCP.

Other Resources
I should mention that while researching public DNS servers I was also led to commercial versions of the same thing. I went so far as to test the timings on one of those services and found that it is more distant, round-trip-wise, than Google's anycast server. Stands to reason. Google's got the best Internet access of anyone. They're on all the major highways. The commercial offerings have some additional cool features, however. They can serve as URL filter. So if someone puts in a URL which leads to a malicious site, for example, they can respond with an answer that spares you from going to that infected site. This is a little more crude than URL filtering at the proxy level, since a DNS server has no knowledge of the URI whereas a proxy URL filter does, but it could be quite serviceable. I'm not sure it allows you to pick and choose URL categories to block as with a URL filter (gambling, porn, hacking sites, etc.).

A lot more information on using Google DNS is at http://code.google.com/speed/public-dns/docs/using.html.

September 1 Update - a Crack in the Infrastructure
I now have my first case of a domain name which Google DNS did not resolve correctly, and for no apparent reason. The domain name is forums.tweaktown.com. Here's proof of Google's failure, followed immediately by Amazon's DNS servers' success:

dig forums.tweaktown.com @8.8.8.8

; <<>> DiG 9.7.1-P2 <<>> forums.tweaktown.com @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 15826
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;forums.tweaktown.com.          IN      A

;; AUTHORITY SECTION:
tweaktown.com.          116     IN      SOA     ns21.domaincontrol.com. dns.jomax.net. 2011060602 28800 7200 604800 86400

;; Query time: 4 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Thu Sep  1 14:40:50 2011
;; MSG SIZE  rcvd: 106


 dig forums.tweaktown.com

; <<>> DiG 9.7.1-P2 <<>> forums.tweaktown.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52290
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1

;; QUESTION SECTION:
;forums.tweaktown.com.          IN      A

;; ANSWER SECTION:
forums.tweaktown.com.   1885    IN      A       38.101.21.25

;; AUTHORITY SECTION:
tweaktown.com.          1943    IN      NS      ns22.domaincontrol.com.
tweaktown.com.          1943    IN      NS      ns21.domaincontrol.com.

;; ADDITIONAL SECTION:
ns21.domaincontrol.com. 753     IN      A       216.69.185.11

;; Query time: 0 msec
;; SERVER: 172.16.0.23#53(172.16.0.23)
;; WHEN: Thu Sep  1 14:40:55 2011
;; MSG SIZE  rcvd: 122

All BIND servers I tried during this time returned the correct answer.

Is this an isolated incident or a tip of an iceberg of problems? I hope it is a one-off. I'll post updates as I find out more. I am slightly concerned now.

References and related
I finally wrote my own web interface to DNS and published the code I did it with. Check it out here.

A web interface to Google's public DNS service, which will give you more debug information, is https://dns.google.com/

Categories
IT Operational Excellence Network Technologies

Swapping Servers while Preserving IPs – What Might Go Wrong

The Setup

I had this experience last week. I needed to swap a virtual server in place of a physical server. I had all the access I needed on both servers to do the necessary network changes, which is how I customarily do these things. I implement network configuration changes as opposed to, say, plugging cables in and pulling others out.

The Issue

Anyways, this sounded straightforward enough.  The physical server had  a backup interface on a different segment – one that I could access from the backup interface of another server also on that backup segment (so that I wouldn’t disconnect myself as I was shutting down the primary interface).  So as I was saying, simple: shutdown the primary interface on the physical server, configure the VM’s intereface similar to the physical server, reboot the virtual server so the interface changes take effect and can be seen to be working even after a reboot.  But it didn’t work, or more precisely, it half-worked.  Why?

A Trail of Hints

Here’s what I didn’t yet say that turns out that has a significant role though I did not know it at the time.  See, that interface had two IPs defined, a primary and a virtual, I’ll call it secondary since virtual is a loaded term, IP, both on the same segment, i.e., eth0 and eth0:ns2v.  After the switch eth0 was working OK, but eth0:ns2v was not!  I also need to mention how they are used, from a network perspective, to see if you are following the hints and can guess what the problem might be before I spell it out.  I have different DNS servers bound to these interfaces.  They are resolving DNS servers.  It actually does not matter (another hint!) but the OS is SLES 11.

Final hint: eth0 probably took a few minutes to work, eth0:ns2v was not working even after 17 minutes.  By not working I mean that I could see the interface on the VM come up OK, my DNS server was bound to it and I could send it queries from the VM itself.  But queries from servers on other segments to this secondary were not being returned.  I tried a trace on the VM: tcpdump -i eth0:ns2v  (OK. I lied.  More hints on the way.  This is how you solve such problems!), while doing a PING from a server on another segment. Nothing coming in.  PINGs and DNS queries to primary interface were coming in fine however.  So I know I had my routing correct.

Biggest hint of all: I could PING this secondary interface from another server on the same segment!

So what the heck is going on here?  And it’s late at night of course so no one is disturbed by this change.  I begin to suspect the router.  After all, everything else is good, right?

Do I bother the network guy to fix his router?  For me that’s akin to admitting failture to plan.  So, no, I don’t want to.  That secondary interface isn’t that important.  it could wait until morning.  But it nags at me…

First Inkling

Then it hits me.  The Aha moment.  Let me back up.  Like I said I become convinced that the router is simply wrong.  But it’s one device I do not have any administrative access to.  What do I mean by “wrong” from a network engineer’s point-of-view?  I became convinced that its ARP table hadn’t aged out its entry for the secondary IP as it ought to have.  All hosts maintain an ARP table which stores the correspondence between IP address and MAC address of other devices on the same segment.  It’s how a device “knows” to talk to the right device when an application specifies an IP address – by correctly converting it into a MAC address.  So, you see, I preserved IPs.  But what if  the router held onto the old MAC address for the secondary IP?  It would try to send traffic destined for that IP which came in from other segments to the wrong place, or no place at all, since the old MAC was now offline.  I’m not exactly sure what happens to those packets.  I’d have to investigate and think about it some more.  Could be they get sent out via the switch but dropped by the switch as there’s no place to deliver them.

But the one IP, the main one, was working.  If you can’t solve what’s wrong, it’s a good idea to review what’s gone right amongst the things which are closely related.  And try to understand the difference in the two cases.

Aha Moment

That was the real Aha moment.  A server is always doing a bit of communication.  This and that chatter.  I realized the router was seeing some of that, and that it was all coming from the main IP.  Why? Because that’s just how things work in IPv4.  Usually.  So it made some sense that the router would update the ARP entry of the main IP.  After all it was seeing these packets come to it which contained the new MAC address/old IP address.  So it probably “knew” to update its ARP table with the new MAC from those packets.  But it wasn’t getting any packets that contained the new MAC address/old secondary IP address combination!  Knowing this situation, you would hope, as a reasonable person, that there would be an ARP table timer on all the ARP entries and they would simply age out and be renewed from time-to-time to prevent just such a situation.  You would hope, but it wasn’t happening.  17 minutes is a long time.  I later learned that this was an old router.  Supposedly it has an ARP timer of five or ten minutes.  But I know that isn’t correct. 

But I did not know any of that at the time.  I knew the main interface worked, the secondary didn’t.   Packets were streaming out of the primary to the router, no packets were streaming from the secondary to the router.  So I asked myself: how can I send packets from the secondary interface??  How do you do that?

I suspected two ways offhand.  I’m sure there are lots of others.  I suspected PING could do it.  Check the man pages.  Yup. ping -I interface_address.  Bingo.  I decided to ping the router, which is, of course, my default gateway, with the secondary IP as source.  The packets were returned.  Good.  Then I noticed that my monitors were completing.  I checked seconds later.  Sure enough, I could now reach that secondary IP from other segments.  Yeah!  Problem resolved.

Mystery solved, and no cold call to the networking group required.

Tying Up the Loose Ends

What would I have asked for if I had called the networking group?  I would have told them I suspect the ARP table on their router was not updating and could they please delete the ARP entry for that secondary IP, that’s what.  That’s what I would have done right away myself if I had had that kind of access.  On *ix devices there is usually a command like arp -d ip_address to delete a specific ARP entry. 

This also explains why another device on that segment could see that secondary IP while at the same time the router couldn’t.  The other device obviously had a more well-behaved ARP time-out mechanism.  Or perhaps it  didn’t but it had had no ARP entry for that secondary IP until after the server switch?  And of course the way modern switches work the traffic is all directed and carved up.  So the communication between those two devices, which would have contained nice and uptodate MAC/IP entries was completely segregated and none of it would have been seen by the router, so in that sense wasn’t helping any.  And what was the other way to send packets from a specific IP?  dig.  I use dig constantly in my capacity as a DNS admin, so I was aware it also allows you to specify your source IP address (dig -b).  Another way that most people would have thought of?  nmap.  I haven’t really checked, but I’m willing to bet nmap could easily also have been used.  But that’s kinf of a “nasty” utility and actually isn’t normally available on self-respecting servers.  It certainly wasn’t on this one.  sendmail MTA could also be used for this same purpose (setting the source IP), but that’s a pain in the rear to set up.  As I say there are probably lots of other utilities that do this.  nc or netcat, depending on your version of Linux, may also be promising.  The aspiring programmers could write a simple PERL (or pick your language) client to do the same thing, etc.   I now see that even telnet allows you to specify your source IP with the -b switch.  So it seems to be a fairly common feature – though not universal, just try to find it on an FTP client – in most networking utilities.

An IT person benefits from having lots of tools which accomplish the same things in different ways.

More Details As Time Permits