One of the high arts of IT is system integration, and an important off-shoot of this is acquisitions. We are involved in integrating a new location, which, unfortunately, we do not yet have full access to. The local networking is still provided by their vendor, not ours and this makes troubleshooting all the more difficult.
So the word begins to spread that users at this site are having intermittent problems accessing some of our secure web sites. As it was described to me, they can load the page in their browser for, say five straight times, get a simple Internet Explorer cannot display the web page error, and the sixth time (or whenever) it will load properly. All other connectivity was working. No one else at other locations was having this problem with this web site. More than strange, right?
In drjohn’s perfect IT world, problem reproducibility is critical to resolution, but we simply didn’t have it this time. I also could not produce the problem myself, which means relying on other people.
I’m not sure if we tried to contact their vendor or not at first. But if we had I’m sure they would have denied having anything to do with it.
So we got one of our confederates, Tim, over to this location and we hooked him up with Wireshark so he could get take a packet trace when the failure occurs. It wasn’t long before Tim reproduced the error and emailed us the packet capture.
In the following the PC has IP address 10.200.23.34, the web server is at 10.4.5.6. The Linux command used to look at the capture file is:
# tcpdump -A -r bodega-error.cap port 443 > /tmp/dump
1 15:54:27.495952 IP 10.200.23.34 > 10.4.5.6.https: S 2803722614:2803722614(0) win 64240 <mss 1460,nop,wscale 0,nop,nop,sackOK>
2 15:54:27.496309 IP 10.4.5.6.https > 10.200.23.34: S 3201081612:3201081612(0) ack 2803722615 win 5840 <mss 1432,nop,nop,sackOK>
3 15:54:27.496343 IP 10.200.23.34 > 10.4.5.6.https: . ack 1 win 64240
4 15:54:27.497270 IP 10.200.23.34 > 10.4.5.6.https: P 1:82(81) ack 1 win 64240
5 15:54:27.497552 IP 10.4.5.6.https > 10.200.23.34: . ack 82 win 5840
6 15:54:30.743827 IP 10.4.5.6.https > 10.200.23.34: P 1:286(285) ack 82 win 5840
..S.......^M..i.P.......HTTP/1.0 200 OK^M
X-Bypass-Cache: Application and Content Networking System Software 5.5.17^M
<HTML><HEAD><META HTTP-EQUIV="REFRESH" CONTENT="0;URL=https://10.4.5.6/"></HEAD><BODY>
7 15:54:30.744036 IP 10.200.23.34 > 10.4.5.6.https: F 82:82(0) ack 286 win 63955
8 15:54:30.744052 IP 10.4.5.6.https > 10.200.23.34: F 286:286(0) ack 82 win 5840
9 15:54:30.744077 IP 10.200.23.34 > 10.4.5.6.https: . ack 287 win 63955
10 15:54:30.744289 IP 10.4.5.6.https > 10.200.23.34: . ack 83 win 5840
The output was scrubbed a bit of meaningless junk characters and I added serial packets numbers in the beginning by hand because I don’t (yet) know how to do that with tcpdump!
What, It’s Encrypted – what can you even learn from a trace?
Yeah, an SSL stream sure adds to the already steep challenges we faced in this problem. There just isn’t much to work with. But it is something. I’m about to say what I noticed in this packet trace, but for it to be meaningful you need to know like I did that the web server is situated almost four thousand miles from the user’s location.
The first packet is a SYN from the PC to web server on TCP port 443. So far so good. In fact packets one – three constitute the three-way handshake in TCP.
Although SSL is encrypted, the beginning of the protocol communication should show the SSL cipher being chosen. Unfortunately, tcpdump doesn’t seem to have the smarts to show any of this. So I got myself ssldump. On Ubuntu:
# sudo apt-get install ssldump
did the trick. Then run this same capture file through ssldump, which has very similar arguments to tcpdump:
# ssldump -r bodega-error.cap port 443
New TCP connection #1: 10.200.23.34(2027) <-> 10.4.5.6(443)
1 1 0.0013 (0.0013) C>S SSLv2 compatible client hello
Unknown value 0xff
Unknown SSL content type 72
1 3.2480 (3.2467) C>S TCP FIN
1 2 3.2481 (0.0000) S>CShort record
1 3.2481 (0.0000) S>C TCP FIN
The way to interpret this is that 0.0013 s into the TCP port 443 communication the cipher suites listed above were sent by the PC to the server. This corresponds to our packet number 4 in the trace file.
Using Wireshark to look at the trace is a lot more convenient – it provides packet numbers, timing, decodes packets and displays the SSL ciphers. But I wanted to show that it _could_ be done with text-based tools.
Look at the timings more closely. In the tcpdump output, packet 2, the SYN ACK, comes 1 ms after the SYN. But given the distances involved between PC and server, the SYN ACK should have come more like 100 ms later, at least. Similarly packet 5, which is an ACK, comes less than 1 ms after packet 4. A 1 ms ACK? Physically impossible.
I have seen this behaviour before – on our own load balance – which I know employs some TCP optimization tricks. So I concluded that they must have physically present at this site some kind of appliance which is doing TCP optimization. It can only provide blank ACKs in its rapid-fire responses since it can’t know what data the server is really going to respond with. That might all be OK. But I’m pretty sure the problem lies between packets 5 and 6. 5 is one of those meaningless rapid-fire empty ACKs generated by the local router. But the PC has just sent a wish list of SSL ciphers in packet 4. It needs to be responded to by the server which has to finish setting up the SSL session.
But that critical packet from the server never arrives. Perhaps even some of the SSL handshake is secretly completed between the local router and the server. Who knows? I have heard of man-in-the-middle devices that decrypt SSL sessions. And packet 6 contains fairly inappropriate content. It almost does look like it has been manufactured by a man-in-the-middle device. Its telling the browser to do a redirect to the same site, except specified by IP address rather than FQDN. And that doesn’t make a lot of sense. The browser likely realizes that this amounts to a looping redirect request so at that point it probably decides to cut its losses and FIN the connection in packet 7.
I traced my own PC hitting this same web server. Now I know we don’t have any of these optimizing devices between me and the web server. I don’t have time to show the results here, but to summarize, it looks rather completely different from the trace above. The ACK packets come back in about 100 ms or so. There is no delay of three seconds. The cipher proposals are responded to in a timely fashion. There is no redirect.
Their Side of the Story
We did get to hear back from the vendor who supports the LAN/WAN. They said they were running WCCP and diverting traffic to a proxy server. This was the correct behaviour before we hooked our infrastructure to this site, but is no longer. They realized this was probably a bad thing and took corrective action to turn off WCCP for destinations in the internal network 10.0.0.0/8.
Shutting off WCCP, which diverted web site requests to an old proxy server, fixed the problem.
I wish we could tie all the loose ends neatly up, but there are too many players involved. We’ll never really know why the problem was intermittent, for instance. Or why some secure web sites could be accessed without any issue whatsoever throughout this ordeal.
WCCP, Web cache Communication Protocol, is a Cisco-developed routing protocol to transparently intercept traffic destined for web servers. More information can be found on it in wikipedia.
It bothers me that after the SSL session was initiated the dump showed the source, unencrypted, of the HTML redirect packet. Why wasn’t that encrypted? Perhaps the WCCP-invoked proxy server was desperately trying to help the PC recover from an unrecoverable situation and manufactured that HTTP-EQUIV REFRESH… to try to force the PC to choose a web site that might work. The fact that it was sent unencrypted over a channel that should have been encrypted was probably even the death bell that triggered the browser to think this makes no sense at all and is even a violation of security, I’m getting out of here.