They say when it rains it pours. As a harassed It support specialist its tempting to lump all similar-looking problems that come across your desk at the same time in the same bucket. This case shows the shallowness of that way of thinking, for that’s exactly what happened in this case. It occurred exactly when this case was occurring in which a user had an issue displaying a secure web site. That other case was described here.
I was doing some work for a company when they came to me with a problem one user at HQ was having accessing an https web site. Everything else worked fine. The same web page worked fine from other sites.
Since I had helped them set up their proxy services I was keen to prove that the proxy wasn’t at fault. The user removed the proxy settings – still the error occurred. The desktop support got involved. I suggested a whole battery of tests because this was a weird one.
– what if you access via proxy?
– what if you take that laptop and access it from a VPN connection?
– what if you access another site which uses the same certificate-issuer?
– what if you access the host, but using http?
– can you PING the server?
– can you telnet to that server on port 443?
– what if you access the webpage by IP address?
– what if accessed via Firefox?
The error, by the way, was
Internet Explorer cannot display the webpage
What you can try:
Diagnose Connection Problems
The answers came back like this:
– what if you access via proxy? It works!
– what if you take that laptop and access it from a VPN connection? It works.
– what if you access another site which uses the same certificate-issuer? It works.
– what if you access the host, but using http? It works.
– can you PING the server? Yes.
– can you telnet to that server on port 443? Yes
– what if you access the webpage by IP address? It does not work.
– Firefox? I don’t remember the answer to this.
Most of that thinking behind those tests is pretty obvious. Why so many tests? The networking group is kind of crotchety and understaffed, so they really wanted to eliminate all other possibilities first. And at one point I thought it could have been a desktop issue. Why would the network allow some of these packets through but not others when it doesn’t run a firewall? The desktop did have A-V and local firewall, after all.
So I didn’t have proof or even a good idea. Time to take out the big guns. We ran a trace on the server with tcpdump while the error occurred. The server was busy so we had to be pretty specific:
> tcpdump -s 1580 -w /tmp/drjcap.cap -i any host 10.19.79.216 and port 443
What do we see? We see the initial handshake go through just fine. Then a client Hello, then a server Hello, followed by a Certificate sent from the server to the client. The Certificate packet kept getting re-sent because there was no ACK to it from the client. So it’s beginning to smell like a dropped packet somewhere. I was hot on the trail but I decided I needed even stronger proof. We arranged to do simultaneous traces on both PC and server to compare the two results.
On the PC we had to install Wireshark. Actually I think Microsoft also has a utility to do traces but I’ve never used it. What did we find? Proof that there was indeed a dropped packet. That server certificate? Never received by the PC (which is acting as the SSL client). But why?
I had noticed one other funny thing about the trace comparisons. The server hello packet left as a packet of length 1518 bytes, but arrived, apparently, as two packets, one of length 778 bytes, the other of 770 bytes, i.e., it was fragmented. That should be OK. I don’t think the don’t fragment flag was set (have to check this). But it got me to wondering. Because the server certificate packet was also on the large side – 1460 bytes.
Regardless, I had enough ammo to go to the networking group with, which I did. It was like, “Oh yeah, our Telecom provider (let’s call it “CU”) implemented GETVPN around that time.” And further discussion revealed suspected other problems related to this change with MTU, etc. In fact they dredged up this nice description of the pitfalls that await GET VPN implementations from the Cisco deployment guide:
3.8 Designing Around MTU Issues Because of additional IPsec overhead added to each packet, MTU related issues are very common in IPsec deployments, and MTU size becomes a very important design consideration. If MTU value is not carefully selected by either predefining the MTU value on the end hosts or by dynamically setting it using PMTU discovery, the network performance will be impacted because of fragmentation and reassembly. In the worst case, the user applications will not work because network devices might not be able to handle the large packets and are unable to fragment them because of the df-bit setting. Some of the scenarios which can adversely affect traffic in a GET VPN environment and applicable mitigation techniques are discussed below. LAN MTU of 1500 – WAN MTU 44xx (MPLS) In this scenario, even after adding the 50-60 byte overhead, MTU size is much less than the MTU of the WAN. The MTU does not affect GET VPN traffic in any shape or form. LAN MTU of 1500 – WAN MTU 1500 In this scenario, when IPsec overhead is added to the maximum packet size the LAN can handle (i.e. 1500 bytes) the resulting packet size becomes greater than the MTU of the WAN. The following techniques could help reduce the MTU size to a value that the WAN infrastructure can actually handle. Manually setting a lower MTU on the hosts By manually setting the host MTU to 1400 bytes, IP packets coming in on the LAN segment will always have 100 extra bytes for encryption overhead. This is the easiest solution to the MTU issues but is harder to deploy because the MTU needs to be tweaked on all the hosts. TCP Traffic Configure ip tcp adjst-mss 1360 on GM LAN interface. This command will ensure that resulting IP packet on the LAN segment is less than 1400 bytes thereby providing 100 bytes for any overhead. If the maximum MTU is lowered by other links in the core (e.g. some other type of tunneling such as GRE is used in the core), the adjust-mss value can be lowered further. This value only affects TCP traffic and has no bearing on the UDP traffic. 126.96.36.199 Host compliant with PMTU discovery For non-TCP traffic, for a 1500 packet with DF bit set, the GM drops the packet and send ICMP message back to sender notifying it to adjust the MTU. If sender and the application is PMTU compliant, this will result in a packet size which can successfully be handled by WAN. For example, if a GM receives a 1500 byte IP packet with the df-bit set and encryption overhead is 60 bytes, GM will notify the sender to reduce the MTU size to 1440 bytes. Sender will comply with the request and the resulting WAN packets will be exactly 1500 bytes
I don’t claim to understand all those scenarios, but it shouts pretty loudly. Watch out for problems with packets of size 1400 bytes or larger!
Why would they want GET VPN? To encrypt the HQ communication over the WAN. So the idea is laudable, but the execution lacking.
Case: understood but not yet closed! The fix is not yet in…
To be continued…