Intro
I sometimes consult for the networking group of a large company. This incident really happened. I don’t know that it could ever happen again to anyone else, but it’s so bizarre that I just had to document it as an example of “you wouldn’t believe it unless you had actually been through it yourself.”
Let’s get into it
This company has lots of small and mid-sized offices connected via MPLS. WAN services are provided by a single telecom throughout the country. I feel obliged to not divulge specifics here. Let’s call the telecom “OE” as in over-extended.
So just before lunch yesterday they tell me that no PCs at one of their sites can access Internet, and this information is coming from a very reliable source. It also comes out that a second site is similarly affected. It kind of sounds like a WAN problem, but no other sites are affected. In the old days you’d almost certainly know to look at the WAN, but these days it’s a little more complicated. Everyone’s PC is in AD and they have the ability to push a GPO to all PCs at a site, so you just never know if the desktop group wasn’t involved in messing them up.
So they tell me they can PING their corporate Intranet server. Fine. But they cannot telnet to port 80. Newsflash. How did they get telnet enabled in Windows 7? I mentally stored this question for my continuing education. Crises are also great learning/teaching moments if you are of that frame-of-mind!
Ping is good. Of course I test the Intranet server myself, iwww.intranet. I can reach port 80 just fine. It happens that the front-end for iwww.intranet is a load balancer. I decide to do a trace using tcpdump. I’m not sure what I’ll find, but taking a trace is sort of a gut reaction in these cases. There’s lots of other traffic so we have to use a filter to see the tree in the forest. They give me the IP of the PC they’re testing from. My expression is something like this:
> tcpdump -i 0.0 host WKSTATION.AD.INTRANE
The 0.0 on this particular device is its way of saying use any and all interfaces.
Here’s what the output looks like:
11:51:03.852511 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de0a5), length 100
11:51:05.855446 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de101), length 100
11:51:06.187940 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de10b), length 100
11:51:08.857957 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de16d), length 100
11:51:09.184072 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de179), length 100
11:51:09.858865 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de198), length 100
11:51:14.855327 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de272), length 100
11:51:15.183349 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de27f), length 100
11:52:19.898380 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3decab), length 100
11:52:22.901684 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3ded28), length 100 |
11:51:03.852511 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de0a5), length 100
11:51:05.855446 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de101), length 100
11:51:06.187940 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de10b), length 100
11:51:08.857957 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de16d), length 100
11:51:09.184072 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de179), length 100
11:51:09.858865 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de198), length 100
11:51:14.855327 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de272), length 100
11:51:15.183349 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de27f), length 100
11:52:19.898380 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3decab), length 100
11:52:22.901684 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3ded28), length 100
I had never seen that before. I loop up ESP and see it is related to IP protocol 50 which in turn is used in IPSEC for VPN connections.
What the…?
Yeah. Let’s review. He’s sending TCP packets to port 80 of iwww.intranet and all I’m seeing are these ESP packets, which isn’t even TCP. Of course the load balancer has no idea whatsoever what to do with those packets and simply does not respond to them.
What would you do? I felt the nail in the coffin would be to take a trace on the PC itself – see how the packets are when they’re coming straight out of the PC. But to be honest they never did make that trace. They didn’t have Wireshark installed so it would take awhile.
Meanwhile, the infrastructure folks are talking to each other and someone mentions the OE has a certain project “runVPN” that thay’re rolling out. Now that sounds suspicous. In the imperfect world you have to work with what you have, not what you’d like to have. Based on our experiences and educated hunches, we now feel pretty certain it’s gotta be a WAN problem caused by OE.
Within an hour OE confirms the problem is of their creation, and they have it fixed. They are very unhappy with the tech who caused it.
Conclusion
Sometimes things are what they appear to be. If you notice I didn’t do much to really help with the issue, and that’s all just about anyone can do when so much is outsourced. They did feel that my trace helped to convince the telecom that this really was their issue.
I guess they were encrypting WAN traffic on one end, but forgot to decrypt it on the other end. One of the strangest things to have on a production network.
Turning on telnet in Windows 7
Did I mention that they tested with telnet on Windows 7? They later explained how to enable it. Go to control panel / Programs / Programs and Features / Turn Windows features on. There is an option for Telnet client. Reboot (yes, you really need to reboot for this to take effect) and you’re good to go.