Intro
This case hardly qualifies as worthy subject matter for the IT Detective Agency – it’s pretty run-of-the-mill stuff. But I wanted to document it for completeness and show how a problem in one thing can turn out to have an unexpected cause (At least to me. In hindsight it’s dead obvious what the issue was likely to have been).
The Situation
We have lots of servers at drjohns. So when one of our admins, Shake, said that one of them, nfuz01, can’t reach the Etrust serverto get its virus updates I had no recollection of what that server is or does. Shake asked if the firewall had changed recently. That’s sort of a tricky question because there are always minor changes being done. Most have absolutely no effect because they are additional rules providing new permissions. So I bravely answered No, it hadn’t. And I wondered what he meant in using the word “reach” anyways.
So I walk up to Shake’s desk to get a better idea. He said not only are updates not virus signature updates not occurring, but neither server can PING the other, neither by name nor by IP address. Now we’re getting somewhere. I still haven’t registered where nfuz01 is, but I know the firewall as I’ve set it up permits ICMP traffic to transit. I suggested that maybe nfuz01 had some missing or messed-up routes. Then I went back to my desk to think some more. That’s what gets me motivated – when I’ve publicly speculated about the root cause of something. It’s not so much that I may be proven wrong, but if I am wrong, I want to be the first to find out and issue a correction.
So I tried a PING from my desktop:
C:\>ping 10.91.12.14 Pinging 10.91.12.14 with 32 bytes of data: Reply from 171.18.252.10: TTL expired in transit. Reply from 171.18.252.10: TTL expired in transit. |
I look up where nfuz01 is. It is in a secondary data center. I ping it from a server in that same data center, but one a different segment – it works fine! I ping it from a Linux server in my main data center – totally different results:
> ping 10.91.12.14 PING 10.91.12.14 (10.91.12.14) 56(84) bytes of data. From 171.18.252.10 icmp_seq=1 Time to live exceeded From 171.18.252.10 icmp_seq=2 Time to live exceeded --- 10.91.12.14 ping statistics --- 2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1000ms > traceroute -n 10.91.12.14 traceroute to 10.91.12.14 (10.91.12.14), 30 hops max, 40 byte packets 1 10.136.188.2 1.100 ms 1.523 ms 2.010 ms 2 10.1.4.2 0.934 ms 0.941 ms 0.773 ms 3 10.1.4.141 0.869 ms 0.926 ms 0.913 ms 4 171.18.252.10 1.076 ms 1.096 ms 1.130 ms 5 10.1.4.129 1.043 ms 1.029 ms 1.018 ms 6 10.1.4.141 0.993 ms 0.611 ms 0.918 ms 7 171.18.252.10 0.932 ms 0.916 ms 1.002 ms 8 10.1.4.129 0.987 ms 1.089 ms 1.121 ms 9 10.1.4.141 1.152 ms 1.246 ms 1.229 ms 10 171.18.252.10 2.040 ms 2.747 ms 2.735 ms 11 10.1.4.129 1.332 ms 1.418 ms 1.467 ms 12 10.1.4.141 1.477 ms 1.754 ms 1.685 ms 13 171.18.252.10 1.993 ms 1.978 ms 2.013 ms 14 10.1.4.129 1.930 ms 1.960 ms 2.039 ms 15 10.1.4.141 2.065 ms 2.156 ms 2.140 ms 16 171.18.252.10 2.116 ms 5.454 ms 5.453 ms 17 10.1.4.129 4.466 ms 4.385 ms 4.296 ms 18 10.1.4.141 4.266 ms 4.267 ms 4.260 ms 19 171.18.252.10 4.232 ms 4.216 ms 4.216 ms 20 10.1.4.129 4.182 ms 4.063 ms 4.009 ms 21 10.1.4.141 3.994 ms 3.987 ms 2.398 ms 22 171.18.252.10 2.400 ms 2.484 ms 2.690 ms 23 10.1.4.129 2.346 ms 2.449 ms 2.544 ms 24 10.1.4.141 2.534 ms 2.607 ms 2.610 ms 25 171.18.252.10 2.602 ms 2.742 ms 2.736 ms 26 10.1.4.129 2.776 ms 2.856 ms 2.848 ms 27 10.1.4.141 2.648 ms 3.185 ms 3.291 ms 28 171.18.252.10 3.236 ms 3.223 ms 3.235 ms 29 10.1.4.129 3.219 ms 3.277 ms 3.377 ms 30 10.1.4.141 3.363 ms 3.381 ms 3.449 ms |
Cool, right? We’ve caught a network loop in the act. Now I know it isn’t the firewall, it isn’t the routes on nfuz01 but it is something with networking. So I sent that off to them….
In less than an hour I got the explanation as well as the fix:
All should be reachable again. There’s a loop I can’t clear amongst some [telecom-owned] routers in the main data center. I’ve superseded it with two /27s until they clear it.
And it pings fine now:
> ping 10.91.12.14 PING 10.91.12.14 (10.91.12.14) 56(84) bytes of data. 64 bytes from 10.91.12.14: icmp_seq=1 ttl=125 time=46.2 ms 64 bytes from 10.91.12.14: icmp_seq=2 ttl=125 time=40.1 ms 64 bytes from 10.91.12.14: icmp_seq=3 ttl=125 time=23.1 ms --- 10.91.12.14 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 23.114/36.502/46.275/9.796 ms |
And Shake says the updates came in.
Case closed!
Conclusion
Why wasn’t the problem more obvious to us from the very beginning? Well, if the admin who said nfuz01 couldn’t reach Etrust had tried to log in nfuz01 through the normal Remote Desktop mechanism – and of course failed – then we might have drilled down into a networking cause more quickly. But nfuz01 is a VM and he must have been logged on via VMWare Virtual Center and so he hadn’t noticed that basically the server couldn’t reach anywhere in our main data center. It is also an obscure server (remember that I had no recall about it?) so no one really noticed that it was effectivley out-of-business.