Admin Network Technologies

The IT Detective Agency: virus updates are failing

This case hardly qualifies as worthy subject matter for the IT Detective Agency – it’s pretty run-of-the-mill stuff. But I wanted to document it for completeness and show how a problem in one thing can turn out to have an unexpected cause (At least to me. In hindsight it’s dead obvious what the issue was likely to have been).

The Situation
We have lots of servers at drjohns. So when one of our admins, Shake, said that one of them, nfuz01, can’t reach the Etrust serverto get its virus updates I had no recollection of what that server is or does. Shake asked if the firewall had changed recently. That’s sort of a tricky question because there are always minor changes being done. Most have absolutely no effect because they are additional rules providing new permissions. So I bravely answered No, it hadn’t. And I wondered what he meant in using the word “reach” anyways.

So I walk up to Shake’s desk to get a better idea. He said not only are updates not virus signature updates not occurring, but neither server can PING the other, neither by name nor by IP address. Now we’re getting somewhere. I still haven’t registered where nfuz01 is, but I know the firewall as I’ve set it up permits ICMP traffic to transit. I suggested that maybe nfuz01 had some missing or messed-up routes. Then I went back to my desk to think some more. That’s what gets me motivated – when I’ve publicly speculated about the root cause of something. It’s not so much that I may be proven wrong, but if I am wrong, I want to be the first to find out and issue a correction.

So I tried a PING from my desktop:

Pinging with 32 bytes of data:
Reply from TTL expired in transit.
Reply from TTL expired in transit.

I look up where nfuz01 is. It is in a secondary data center. I ping it from a server in that same data center, but one a different segment – it works fine! I ping it from a Linux server in my main data center – totally different results:

> ping
PING ( 56(84) bytes of data.
From icmp_seq=1 Time to live exceeded
From icmp_seq=2 Time to live exceeded
--- ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1000ms
> traceroute -n
traceroute to (, 30 hops max, 40 byte packets
 1  1.100 ms  1.523 ms  2.010 ms
 2  0.934 ms  0.941 ms  0.773 ms
 3  0.869 ms  0.926 ms  0.913 ms
 4  1.076 ms  1.096 ms  1.130 ms
 5  1.043 ms  1.029 ms  1.018 ms
 6  0.993 ms  0.611 ms  0.918 ms
 7  0.932 ms  0.916 ms  1.002 ms
 8  0.987 ms  1.089 ms  1.121 ms
 9  1.152 ms  1.246 ms  1.229 ms
10  2.040 ms  2.747 ms  2.735 ms
11  1.332 ms  1.418 ms  1.467 ms
12  1.477 ms  1.754 ms  1.685 ms
13  1.993 ms  1.978 ms  2.013 ms
14  1.930 ms  1.960 ms  2.039 ms
15  2.065 ms  2.156 ms  2.140 ms
16  2.116 ms  5.454 ms  5.453 ms
17  4.466 ms  4.385 ms  4.296 ms
18  4.266 ms  4.267 ms  4.260 ms
19  4.232 ms  4.216 ms  4.216 ms
20  4.182 ms  4.063 ms  4.009 ms
21  3.994 ms  3.987 ms  2.398 ms
22  2.400 ms  2.484 ms  2.690 ms
23  2.346 ms  2.449 ms  2.544 ms
24  2.534 ms  2.607 ms  2.610 ms
25  2.602 ms  2.742 ms  2.736 ms
26  2.776 ms  2.856 ms  2.848 ms
27  2.648 ms  3.185 ms  3.291 ms
28  3.236 ms  3.223 ms  3.235 ms
29  3.219 ms  3.277 ms  3.377 ms
30  3.363 ms  3.381 ms  3.449 ms

Cool, right? We’ve caught a network loop in the act. Now I know it isn’t the firewall, it isn’t the routes on nfuz01 but it is something with networking. So I sent that off to them….

In less than an hour I got the explanation as well as the fix:

All should be reachable again. There’s a loop I can’t clear amongst some [telecom-owned] routers in the main data center. I’ve superseded it with two /27s until they clear it.

And it pings fine now:

> ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=125 time=46.2 ms
64 bytes from icmp_seq=2 ttl=125 time=40.1 ms
64 bytes from icmp_seq=3 ttl=125 time=23.1 ms
--- ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 23.114/36.502/46.275/9.796 ms

And Shake says the updates came in.

Case closed!

Why wasn’t the problem more obvious to us from the very beginning? Well, if the admin who said nfuz01 couldn’t reach Etrust had tried to log in nfuz01 through the normal Remote Desktop mechanism – and of course failed – then we might have drilled down into a networking cause more quickly. But nfuz01 is a VM and he must have been logged on via VMWare Virtual Center and so he hadn’t noticed that basically the server couldn’t reach anywhere in our main data center. It is also an obscure server (remember that I had no recall about it?) so no one really noticed that it was effectivley out-of-business.

Leave a Reply

Your email address will not be published. Required fields are marked *