Categories
Admin Network Technologies

The IT Detective agency: bad PING times explained

Intro
In a complicated corporate environment somewhat unusual problems can be made extremely difficult to debug as there may be many technicians involved, each one knowing just their piece of the infrastructure. Such was the case when i consulted for a problem in which a company reported slow Internet response for users in Asia.

The details
In preparation for an evening call I managed to get enough access to be able to log into the proxy server their Asian users use. Just issuing regular commands was slow, often hanging for many seconds.

The call
So we had the call at 9 PM. These things are always pretty amazing in the sense of How do corporations ever get anything done when they’ve outsourced so much? So there was a representative from the firewall team from Germany, representative from the telecom in the US, a couple representatives from the same telecom but stationed in Asia, an employee who oversees the telecom vendor in the US and me, representing the proxy service normally handled by a group in Europe. The common language was English, of course, though that doesn’t mean that everyone was easy to understand for us native speakers. No one on the call had any real familiarity with the infrastructure. We were all essentially reverse engineering it from a diagram the telecom produced.

The firewall guy and I both noticed that PINGs from the proxy to its gateway (which was a firewall) took 50 msec. I’ve never seen that. Same for another piece of equipment using that same firewall interface. The telecom, which was responsible for the switches, could not actually log in to all of them. Only the firewall guy could reach a few of them. And he eventually figured out the diagram we were all using was somewhat wrong and different switches were in use for some of the equipment than indicated. Which was important because we wanted to check the interface status on the switch.

So imagine this. The guy with no access to anything, the vendor overseer in the US, patiently asks for the results of a few commands to be shared (by email) with the group. He gets the results of show interface status and identifies one port as looking off. It’s listed as 100 mbit instead of 1 gig. In addition to the strange PNIG times, when we PINGed from equipment in the US, packet loss rate varied frmo between 4 – 15%. Pretty high in other words. They try to reset the port, hard-code both sides, but nothing works. And this is the port that the firewall is connected to.

We finally switch to the backup firewall. This destroys the routing and so that has to be fixed. But finally it is and suddenly response is much better and PINGing the gateway from the proxy is at the expected < 1 msec. Not content to leave it at that, they persist to fix the broken port. Thy reason that the most likely problem after the test results are in is a bad cable. He explains that in 1 gig communication all 8 wires have to be good. If just one breaks you can't run 1 gig. Now they have to figure out who has access to this 3rd-party data center where the equipment is hosted! They finally identify an employee with access and get him to come with different cable lengths (of course no one knows the layout to actually know how long the cable is or how close the equipment is). The cable is replaced and both sides come up 1 gig auto-negotiated! They reverted back to the primary firewall. So in the end the employee without access to anything figured this out. Amazing. The intense activity on this problem lasted from 9 PM to about 4 AM the following morning. The history
Actually it would be a pretty decent turn-around by this company’s standards if the problem had been resolved in seven hours. But actually it had been ongoing for a couple weeks beforehand. It seemed that the total data usage was capped at 100 mbits by that bad cable and so it wasn’t a total outage or totally obvious where to look.

Case closed!

Conclusion
I think a lot of people on the call had the expertise to solve the problem, and much more quickly than it was solved. But no one had sufficient access to do debugging his own way and needed cooperation of others. The telecom who owns and manages the switches particularly disappointed in their performance. Not in the individuals, who seemed to be competent, but in their processes which permitted faulty and incomplete documentation, as well as lack of familiarity with the particular infrastructure – like you’ve just hired a smart network technician and communicated nothing about what he is supporting.
Keep good people on staff! Give them as much access as possible.

Leave a Reply

Your email address will not be published. Required fields are marked *