Admin DNS IT Operational Excellence

The IT Detective Agency: since when can a powered off PC do dynamic DNS updates?

The IT Detectives are back after a short lull during which no great mysteries needed expert resolution – you knew that situation couldn’t last too long. The following tale was relayed to me, I unfortunately cannot claim to have been any help whatsoever. The details have been somewhat obscured in this retelling.

The details
One of our DNS servers at drjohns was busy fielding lots and lots of DDNS updates. Good, right? No, not so. Because our employee PCs are all configured to not do this very thing. In Windows 7 drilling down into the advanced DNS settings you have a Checkbox for Register this connection’s addresses in DNS. And that is unchecked. So although we use DHCP, the PCs shouldn’t be sending their DDNS updates. Yet they were. In fact at one point a considerable amount of bandwidth was being eaten up with these unwanted updates, so we had to investigate and act. But where to begin?

Word finally got around to one of our PC experts who I guess probably had his suspicions. He suggested the following test:

turn the PC off and look for DDNS updates on the DNS server

Amazingly, that’s exactly what we found to be the case – DDNS updates coming from a powered off PC. The DDNS updates did not always go to the same DNS server. The chosen DNS server seemed randomly chosen, but they all were drjohns DNS servers.

A Wireshark examination of a trace (taken by a network engineer) showed lots of Dynamic Update SOA I looked at the trace and found that that was just a title given by Wireshark for what was happening, and not a very accurate one. If you expand the packet you saw inside of it that (mostly) it was a workstation trying to register its A record on the DNS server (a DDNS update). It wasn’t literally trying to change the SOA record for the zone though that might have been the logical result of updating its A record.

What the power-off test showed to our subject-area expert is that Intel vPro was responsible for these DDNS updates. Wait, you ask, what the heck is vPro? We didn’t know either. As I understand it, it’s an additional Intel chip that some business-class laptops (e.g., DELL Latitude) might include that permits more and better remote management, allowing perhaps even some hardware diagnostics to occur.

So let’s go back to that test. Note that I said PC powered off, I did not say disconnected from the network! Powered-off-but-network-connected produces the DDNS update, powered-off-and-disconnected – no update, of course (Hey, it’s not magic going on here!).

So the solution, obvisouly, is to turn off DDNS in vPro. We thought it was off, but maybe not. We expect and hope this to the solution, but a few more days will be needed before this all plays out and we know for sure.

I better hold off on any conclusion until our premise is confirmed! But one feeling I have is that sometimes you have to ingratiate yourself to the right people because no one person has all the answers!

Admin Network Technologies

The IT Detective Agency: Internet Explorer cannot display https web page, part II

They say when it rains it pours. As a harassed It support specialist its tempting to lump all similar-looking problems that come across your desk at the same time in the same bucket. This case shows the shallowness of that way of thinking, for that’s exactly what happened in this case. It occurred exactly when this case was occurring in which a user had an issue displaying a secure web site. That other case was described here.

The details
I was doing some work for a company when they came to me with a problem one user at HQ was having accessing an https web site. Everything else worked fine. The same web page worked fine from other sites.

Since I had helped them set up their proxy services I was keen to prove that the proxy wasn’t at fault. The user removed the proxy settings – still the error occurred. The desktop support got involved. I suggested a whole battery of tests because this was a weird one.
– what if you access via proxy?
– what if you take that laptop and access it from a VPN connection?
– what if you access another site which uses the same certificate-issuer?
– what if you access the host, but using http?
– can you PING the server?
– can you telnet to that server on port 443?
– what if you access the webpage by IP address?
– what if accessed via Firefox?
– etc.
The error, by the way, was

Internet Explorer cannot display the webpage

What you can try:
Diagnose Connection Problems

The answers came back like this:
– what if you access via proxy? It works!
– what if you take that laptop and access it from a VPN connection? It works.
– what if you access another site which uses the same certificate-issuer? It works.
– what if you access the host, but using http? It works.
– can you PING the server? Yes.
– can you telnet to that server on port 443? Yes
– what if you access the webpage by IP address? It does not work.
– Firefox? I don’t remember the answer to this.

Most of that thinking behind those tests is pretty obvious. Why so many tests? The networking group is kind of crotchety and understaffed, so they really wanted to eliminate all other possibilities first. And at one point I thought it could have been a desktop issue. Why would the network allow some of these packets through but not others when it doesn’t run a firewall? The desktop did have A-V and local firewall, after all.

So I didn’t have proof or even a good idea. Time to take out the big guns. We ran a trace on the server with tcpdump while the error occurred. The server was busy so we had to be pretty specific:

> tcpdump -s 1580 -w /tmp/drjcap.cap -i any host and port 443

What do we see? We see the initial handshake go through just fine. Then a client Hello, then a server Hello, followed by a Certificate sent from the server to the client. The Certificate packet kept getting re-sent because there was no ACK to it from the client. So it’s beginning to smell like a dropped packet somewhere. I was hot on the trail but I decided I needed even stronger proof. We arranged to do simultaneous traces on both PC and server to compare the two results.

On the PC we had to install Wireshark. Actually I think Microsoft also has a utility to do traces but I’ve never used it. What did we find? Proof that there was indeed a dropped packet. That server certificate? Never received by the PC (which is acting as the SSL client). But why?

I had noticed one other funny thing about the trace comparisons. The server hello packet left as a packet of length 1518 bytes, but arrived, apparently, as two packets, one of length 778 bytes, the other of 770 bytes, i.e., it was fragmented. That should be OK. I don’t think the don’t fragment flag was set (have to check this). But it got me to wondering. Because the server certificate packet was also on the large side – 1460 bytes.

Regardless, I had enough ammo to go to the networking group with, which I did. It was like, “Oh yeah, our Telecom provider (let’s call it “CU”) implemented GETVPN around that time.” And further discussion revealed suspected other problems related to this change with MTU, etc. In fact they dredged up this nice description of the pitfalls that await GET VPN implementations from the Cisco deployment guide:

3.8 Designing Around MTU Issues Because of additional IPsec overhead added to each packet, MTU related issues are very common in IPsec deployments, and MTU size becomes a very important design consideration. If MTU value is not carefully selected by either predefining the MTU value on the end hosts or by dynamically setting it using PMTU discovery, the network performance will be impacted because of fragmentation and reassembly. In the worst case, the user applications will not work because network devices might not be able to handle the large packets and are unable to fragment them because of the df-bit setting. Some of the scenarios which can adversely affect traffic in a GET VPN environment and applicable mitigation techniques are discussed below. LAN MTU of 1500 – WAN MTU 44xx (MPLS) In this scenario, even after adding the 50-60 byte overhead, MTU size is much less than the MTU of the WAN. The MTU does not affect GET VPN traffic in any shape or form. LAN MTU of 1500 – WAN MTU 1500 In this scenario, when IPsec overhead is added to the maximum packet size the LAN can handle (i.e. 1500 bytes) the resulting packet size becomes greater than the MTU of the WAN. The following techniques could help reduce the MTU size to a value that the WAN infrastructure can actually handle. Manually setting a lower MTU on the hosts By manually setting the host MTU to 1400 bytes, IP packets coming in on the LAN segment will always have 100 extra bytes for encryption overhead. This is the easiest solution to the MTU issues but is harder to deploy because the MTU needs to be tweaked on all the hosts. TCP Traffic Configure ip tcp adjst-mss 1360 on GM LAN interface. This command will ensure that resulting IP packet on the LAN segment is less than 1400 bytes thereby providing 100 bytes for any overhead. If the maximum MTU is lowered by other links in the core (e.g. some other type of tunneling such as GRE is used in the core), the adjust-mss value can be lowered further. This value only affects TCP traffic and has no bearing on the UDP traffic. Host compliant with PMTU discovery For non-TCP traffic, for a 1500 packet with DF bit set, the GM drops the packet and send ICMP message back to sender notifying it to adjust the MTU. If sender and the application is PMTU compliant, this will result in a packet size which can successfully be handled by WAN. For example, if a GM receives a 1500 byte IP packet with the df-bit set and encryption overhead is 60 bytes, GM will notify the sender to reduce the MTU size to 1440 bytes. Sender will comply with the request and the resulting WAN packets will be exactly 1500 bytes

I don’t claim to understand all those scenarios, but it shouts pretty loudly. Watch out for problems with packets of size 1400 bytes or larger!

Why would they want GET VPN? To encrypt the HQ communication over the WAN. So the idea is laudable, but the execution lacking.

Case: understood but not yet closed! The fix is not yet in…

To be continued…

Admin IT Operational Excellence Linux Proxy Web Site Technologies

The IT Detective Agency: intermittent web page not found error

One of the high arts of IT is system integration, and an important off-shoot of this is acquisitions. We are involved in integrating a new location, which, unfortunately, we do not yet have full access to. The local networking is still provided by their vendor, not ours and this makes troubleshooting all the more difficult.

The Details
So the word begins to spread that users at this site are having intermittent problems accessing some of our secure web sites. As it was described to me, they can load the page in their browser for, say five straight times, get a simple Internet Explorer cannot display the web page error, and the sixth time (or whenever) it will load properly. All other connectivity was working. No one else at other locations was having this problem with this web site. More than strange, right?

In drjohn’s perfect IT world, problem reproducibility is critical to resolution, but we simply didn’t have it this time. I also could not produce the problem myself, which means relying on other people.

I’m not sure if we tried to contact their vendor or not at first. But if we had I’m sure they would have denied having anything to do with it.

So we got one of our confederates, Tim, over to this location and we hooked him up with Wireshark so he could get take a packet trace when the failure occurs. It wasn’t long before Tim reproduced the error and emailed us the packet capture.

In the following the PC has IP address, the web server is at The Linux command used to look at the capture file is:

# tcpdump -A -r bodega-error.cap port 443 > /tmp/dump

1 15:54:27.495952 IP > S 2803722614:2803722614(0) win 64240 <mss 1460,nop,wscale 0,nop,nop,sackOK>
2 15:54:27.496309 IP > S 3201081612:3201081612(0) ack 2803722615 win 5840 <mss 1432,nop,nop,sackOK>
3 15:54:27.496343 IP > . ack 1 win 64240
4 15:54:27.497270 IP > P 1:82(81) ack 1 win 64240
5 15:54:27.497552 IP > . ack 82 win 5840
6 15:54:30.743827 IP > P 1:286(285) ack 82 win 5840
..S.......^M..i.P.......HTTP/1.0 200 OK^M
Cache-Control: no-store^M
Pragma: no-cache^M
Cache-Control: no-cache^M
X-Bypass-Cache: Application and Content Networking System Software 5.5.17^M
Connection: Close^M
7 15:54:30.744036 IP > F 82:82(0) ack 286 win 63955
8 15:54:30.744052 IP > F 286:286(0) ack 82 win 5840
9 15:54:30.744077 IP > . ack 287 win 63955
10 15:54:30.744289 IP > . ack 83 win 5840

The output was scrubbed a bit of meaningless junk characters and I added serial packets numbers in the beginning by hand because I don’t (yet) know how to do that with tcpdump!

What, It’s Encrypted – what can you even learn from a trace?
Yeah, an SSL stream sure adds to the already steep challenges we faced in this problem. There just isn’t much to work with. But it is something. I’m about to say what I noticed in this packet trace, but for it to be meaningful you need to know like I did that the web server is situated almost four thousand miles from the user’s location.

The first packet is a SYN from the PC to web server on TCP port 443. So far so good. In fact packets one – three constitute the three-way handshake in TCP.

Although SSL is encrypted, the beginning of the protocol communication should show the SSL cipher being chosen. Unfortunately, tcpdump doesn’t seem to have the smarts to show any of this. So I got myself ssldump. On Ubuntu:

# sudo apt-get install ssldump

did the trick. Then run this same capture file through ssldump, which has very similar arguments to tcpdump:

# ssldump -r bodega-error.cap port 443

New TCP connection #1: <->
1 1  0.0013 (0.0013)  C>S SSLv2 compatible client hello
  Version 3.1
  cipher suites
  Unknown value 0xff
Unknown SSL content type 72
1    3.2480 (3.2467)  C>S  TCP FIN
1 2  3.2481 (0.0000)  S>CShort record
1    3.2481 (0.0000)  S>C  TCP FIN

The way to interpret this is that 0.0013 s into the TCP port 443 communication the cipher suites listed above were sent by the PC to the server. This corresponds to our packet number 4 in the trace file.

Using Wireshark to look at the trace is a lot more convenient – it provides packet numbers, timing, decodes packets and displays the SSL ciphers. But I wanted to show that it _could_ be done with text-based tools.

Look at the timings more closely. In the tcpdump output, packet 2, the SYN ACK, comes 1 ms after the SYN. But given the distances involved between PC and server, the SYN ACK should have come more like 100 ms later, at least. Similarly packet 5, which is an ACK, comes less than 1 ms after packet 4. A 1 ms ACK? Physically impossible.

I have seen this behaviour before – on our own load balance – which I know employs some TCP optimization tricks. So I concluded that they must have physically present at this site some kind of appliance which is doing TCP optimization. It can only provide blank ACKs in its rapid-fire responses since it can’t know what data the server is really going to respond with. That might all be OK. But I’m pretty sure the problem lies between packets 5 and 6. 5 is one of those meaningless rapid-fire empty ACKs generated by the local router. But the PC has just sent a wish list of SSL ciphers in packet 4. It needs to be responded to by the server which has to finish setting up the SSL session.

But that critical packet from the server never arrives. Perhaps even some of the SSL handshake is secretly completed between the local router and the server. Who knows? I have heard of man-in-the-middle devices that decrypt SSL sessions. And packet 6 contains fairly inappropriate content. It almost does look like it has been manufactured by a man-in-the-middle device. Its telling the browser to do a redirect to the same site, except specified by IP address rather than FQDN. And that doesn’t make a lot of sense. The browser likely realizes that this amounts to a looping redirect request so at that point it probably decides to cut its losses and FIN the connection in packet 7.

I traced my own PC hitting this same web server. Now I know we don’t have any of these optimizing devices between me and the web server. I don’t have time to show the results here, but to summarize, it looks rather completely different from the trace above. The ACK packets come back in about 100 ms or so. There is no delay of three seconds. The cipher proposals are responded to in a timely fashion. There is no redirect.

Their Side of the Story
We did get to hear back from the vendor who supports the LAN/WAN. They said they were running WCCP and diverting traffic to a proxy server. This was the correct behaviour before we hooked our infrastructure to this site, but is no longer. They realized this was probably a bad thing and took corrective action to turn off WCCP for destinations in the internal network

Shutting off WCCP, which diverted web site requests to an old proxy server, fixed the problem.

Case closed.

Unsolved Mysteries
I wish we could tie all the loose ends neatly up, but there are too many players involved. We’ll never really know why the problem was intermittent, for instance. Or why some secure web sites could be accessed without any issue whatsoever throughout this ordeal.

WCCP, Web cache Communication Protocol, is a Cisco-developed routing protocol to transparently intercept traffic destined for web servers. More information can be found on it in wikipedia.

It bothers me that after the SSL session was initiated the dump showed the source, unencrypted, of the HTML redirect packet. Why wasn’t that encrypted? Perhaps the WCCP-invoked proxy server was desperately trying to help the PC recover from an unrecoverable situation and manufactured that HTTP-EQUIV REFRESH… to try to force the PC to choose a web site that might work. The fact that it was sent unencrypted over a channel that should have been encrypted was probably even the death bell that triggered the browser to think this makes no sense at all and is even a violation of security, I’m getting out of here.