The IT Detective Agency: virus updates are failing

Intro
This case hardly qualifies as worthy subject matter for the IT Detective Agency – it’s pretty run-of-the-mill stuff. But I wanted to document it for completeness and show how a problem in one thing can turn out to have an unexpected cause (At least to me. In hindsight it’s dead obvious what the issue was likely to have been).

The Situation
We have lots of servers at drjohns. So when one of our admins, Shake, said that one of them, nfuz01, can’t reach the Etrust serverto get its virus updates I had no recollection of what that server is or does. Shake asked if the firewall had changed recently. That’s sort of a tricky question because there are always minor changes being done. Most have absolutely no effect because they are additional rules providing new permissions. So I bravely answered No, it hadn’t. And I wondered what he meant in using the word “reach” anyways.

So I walk up to Shake’s desk to get a better idea. He said not only are updates not virus signature updates not occurring, but neither server can PING the other, neither by name nor by IP address. Now we’re getting somewhere. I still haven’t registered where nfuz01 is, but I know the firewall as I’ve set it up permits ICMP traffic to transit. I suggested that maybe nfuz01 had some missing or messed-up routes. Then I went back to my desk to think some more. That’s what gets me motivated – when I’ve publicly speculated about the root cause of something. It’s not so much that I may be proven wrong, but if I am wrong, I want to be the first to find out and issue a correction.

So I tried a PING from my desktop:

C:\>ping 10.91.12.14
 
Pinging 10.91.12.14 with 32 bytes of data:
Reply from 171.18.252.10: TTL expired in transit.
Reply from 171.18.252.10: TTL expired in transit.

I look up where nfuz01 is. It is in a secondary data center. I ping it from a server in that same data center, but one a different segment – it works fine! I ping it from a Linux server in my main data center – totally different results:

> ping 10.91.12.14
PING 10.91.12.14 (10.91.12.14) 56(84) bytes of data.
From 171.18.252.10 icmp_seq=1 Time to live exceeded
From 171.18.252.10 icmp_seq=2 Time to live exceeded
 
--- 10.91.12.14 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1000ms
 
> traceroute -n 10.91.12.14
traceroute to 10.91.12.14 (10.91.12.14), 30 hops max, 40 byte packets
 1  10.136.188.2  1.100 ms  1.523 ms  2.010 ms
 2  10.1.4.2  0.934 ms  0.941 ms  0.773 ms
 3  10.1.4.141  0.869 ms  0.926 ms  0.913 ms
 4  171.18.252.10  1.076 ms  1.096 ms  1.130 ms
 5  10.1.4.129  1.043 ms  1.029 ms  1.018 ms
 6  10.1.4.141  0.993 ms  0.611 ms  0.918 ms
 7  171.18.252.10  0.932 ms  0.916 ms  1.002 ms
 8  10.1.4.129  0.987 ms  1.089 ms  1.121 ms
 9  10.1.4.141  1.152 ms  1.246 ms  1.229 ms
10  171.18.252.10  2.040 ms  2.747 ms  2.735 ms
11  10.1.4.129  1.332 ms  1.418 ms  1.467 ms
12  10.1.4.141  1.477 ms  1.754 ms  1.685 ms
13  171.18.252.10  1.993 ms  1.978 ms  2.013 ms
14  10.1.4.129  1.930 ms  1.960 ms  2.039 ms
15  10.1.4.141  2.065 ms  2.156 ms  2.140 ms
16  171.18.252.10  2.116 ms  5.454 ms  5.453 ms
17  10.1.4.129  4.466 ms  4.385 ms  4.296 ms
18  10.1.4.141  4.266 ms  4.267 ms  4.260 ms
19  171.18.252.10  4.232 ms  4.216 ms  4.216 ms
20  10.1.4.129  4.182 ms  4.063 ms  4.009 ms
21  10.1.4.141  3.994 ms  3.987 ms  2.398 ms
22  171.18.252.10  2.400 ms  2.484 ms  2.690 ms
23  10.1.4.129  2.346 ms  2.449 ms  2.544 ms
24  10.1.4.141  2.534 ms  2.607 ms  2.610 ms
25  171.18.252.10  2.602 ms  2.742 ms  2.736 ms
26  10.1.4.129  2.776 ms  2.856 ms  2.848 ms
27  10.1.4.141  2.648 ms  3.185 ms  3.291 ms
28  171.18.252.10  3.236 ms  3.223 ms  3.235 ms
29  10.1.4.129  3.219 ms  3.277 ms  3.377 ms
30  10.1.4.141  3.363 ms  3.381 ms  3.449 ms

Cool, right? We’ve caught a network loop in the act. Now I know it isn’t the firewall, it isn’t the routes on nfuz01 but it is something with networking. So I sent that off to them….

In less than an hour I got the explanation as well as the fix:

All should be reachable again. There’s a loop I can’t clear amongst some [telecom-owned] routers in the main data center. I’ve superseded it with two /27s until they clear it.

And it pings fine now:

> ping 10.91.12.14
PING 10.91.12.14 (10.91.12.14) 56(84) bytes of data.
64 bytes from 10.91.12.14: icmp_seq=1 ttl=125 time=46.2 ms
64 bytes from 10.91.12.14: icmp_seq=2 ttl=125 time=40.1 ms
64 bytes from 10.91.12.14: icmp_seq=3 ttl=125 time=23.1 ms
 
--- 10.91.12.14 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 23.114/36.502/46.275/9.796 ms

And Shake says the updates came in.

Case closed!

Conclusion
Why wasn’t the problem more obvious to us from the very beginning? Well, if the admin who said nfuz01 couldn’t reach Etrust had tried to log in nfuz01 through the normal Remote Desktop mechanism – and of course failed – then we might have drilled down into a networking cause more quickly. But nfuz01 is a VM and he must have been logged on via VMWare Virtual Center and so he hadn’t noticed that basically the server couldn’t reach anywhere in our main data center. It is also an obscure server (remember that I had no recall about it?) so no one really noticed that it was effectivley out-of-business.

Leave a Reply Cancel reply