Categories
Admin Proxy

Google Hangouts Meet – what do these IPs all have in common?

2020 update
209.85.144.127
74.125.250.71

Older IPs
173.194.207.127
209.85.232.127
64.233.177.127
64.233.177.103
74.125.196.127

They all have been used by Google’s Hangouts Meet based on my observation.
If you have an environment which uses proxy authentication, the above IPs do not play well with that. So you’ll need to disable proxy authentication for them for Hangouts Meet to work.

Otherwise you can do the initial connect but will be dropped after about 45 seconds.

Finally, although you can look up each individually and learn of its association to Google, Google’s own documentation is devoid of any references to them. That is unfortunate.

So the IPs in actual use is probably much larger, but these are what I’ve observed over the course of a few days of testing.

What I’ve done
Google is very hard to reach. They only provide indirect means for regular users. So not knowing any insiders I submitted feedback at https://meet.google.com/ which is what they suggest. I provided a detailed description. I doubt they will do anything with my feedback. We’ll see.

Conclusion
Google has an undocumented dependency on a whole set of IPs for hangout Meets to actually work through a proxy which requires authentication. Contacting Google for more information is probably impossible, but I will try.

Categories
Admin Network Technologies Proxy Security TCP/IP Web Site Technologies

The IT Detective Agency: Cisco Jabber stopped working for some using WAN connections

Intro
This is probably the hardest case I’ve ever encountered. It’s so complicated many people needed to get involved to contribute to the solution.

Initial symptoms

It’s not easy to describe the problem while providing appropriate obfuscation. Over the course of a few days it came to light that in this particular large company for which I consult many people in office locations connected via an MPLS network were no longer able to log in to Cisco Jabber. That’s Cisco’s offering for Instant Messaging. When it works and used in combination with Cisco IP phones it’s pretty good – has some nice features. This major problem was first reported November 17th.

Knee-jerk reactions
Networking problem? No. Network guys say their networks are running fine. They may be a tad overloaded but they are planning to route Internet over the secondary links so all will be good in a few days.
Proxy problem? Nope. proxy guys say their Bluecoat appliances are running fine and besides everyone else is working.
Application problem? Application owner doesn’t see anything out of the ordinary.
Desktop problem? Maybe but it’s unclear.

Methodology
So of the 50+ users affected I recognized two power users that I knew personally and focussed on them. Over the course of days I learned:
– problem only occurs for WAN (MPLS) users
– problem only occurs when using one particular proxy
– if a user tries to connect often enough, they may eventually get in
– users can get in if they use their VPN client
– users at HQ were not affected

The application owner helpfully pointed out the URL for the web-based version of Cisco Jabber: https://loginp.webexconnect.com/… Anyone with the problem also could not log in to this site.

So working with these power users who patiently put up with many test suggestions we learned:

– setting the PC’s MTU to a small value, say 512 up to 696 made it work. Higher than that it generally failed.
– yet pings of up to 1500 bytes went through OK.
– the trace from one guy’s PC showed all his packets re-transmitted. We still don’t understand that.
– It’s a mess of communications to try to understand these modern, encrypted applications
– even the simplest trace contained over 1000 lines which is tough when you don’t know what you’re looking for!
– the helpful networking guy from the telecom company – let’s call him “Regal” – worked with us but all the while declaring how it’s impossible that it’s a networking issue
– proxy logs didn’t show any particular problem, but then again they cannot look into SSL communication since it is encrypted
– disabling Kaspersky helped some people but not others
– a PC with the problem had no problem when put onto the Internet directly
– if one proxy associated with the problem forwarded the requests to another, then it begins to work
– Is the problem reproducible? Yes, about 99% of the time.
– Do other web sites work from this PC? Yes.

From previous posts you will know that at some point I will treat every problem as a potential networking problem and insist on a trace.

Biases going in
So my philosophy of problem solving which had stood the test of time is either it’s a networking problem, or it’s a problem on the PC. Best is if there’s a competition of ideas in debugging so that the PC/application people seek to prove beyond a doubt it is a networking problem and the networking people likewise try to prove problem occurs on the PC. Only later did I realize the bias in this approach and that a third possibility existed.

So I enthused: what we need is a non-company PC – preferably on the same hardware – at the same IP address to see if there’s a problem. Well we couldn’t quite produce that but one power user suggested using a VM. He just happened to have a VM environment on his PC and could spin up a Windows 7 Professional generic image! So we do that – it shows the problem. But at least the trace form it is a lot cleaner without all the overhead of the company packages’ communication.

The hard work
So we do the heavy lifting and take a trace on both his VM with the problem and the proxy server and sit down to compare the two. My hope was to find a dropped packet, blame the network and let those guys figure it out. And I found it. After the client hello (this is a part of the initial SSL protocol) the server responds with its server hello. That packet – a largeish packet of 1414 bytes – was not coming through to the client! It gets re-transmitted multiple times and none of the re-transmits gets through to the PC. Instead the PC receives a packet the proxy never sent it which indicates a fatal SSL error has occurred.

So I tell Regal that look there’s a problem with these packets. Meanwhile Regal has just gotten a new PC and doesn’t even have Wireshark. Can you imagine such a world? It seems all he really has is his tongue and the ability to read a few emails. And he’s not convinced! He reasons after all that the network has no intelligent, application-level devices and certainly wouldn’t single out Jabber communication to be dropped while keeping everything else. I am no desktop expert so I admit that maybe some application on the PC could have done this to the packets, in effect admitting that packets could be intercepted and altered by the PC even before being recorded by Wireshark. After all I repeated this mantra many times throughout:

This explanation xyz is unlikely, and in fact any explanation we can conceive of is unlikely, yet one of them will prove to be correct in the end.

Meanwhile the problem wasn’t going away so I kludged their proxy PAC file to send everyone using jabber to the one proxy where it worked for all.

So what we really needed was to create a span port on the switch where the PC was plugged in and connect a 2nd PC to a port enabled in promiscuous mode with that mirrored traffic. That’s quite a lot of setup and we were almost there when our power user began to work so we couldn’t reproduce the problem. That was about Dec 1st. Then our 2nd power user also fell through and could no longer reproduce the problem either a day later.

10,000 foot view
What we had so far is a whole bunch of contradictory evidence. Network? Desktop? We still could not say due to the contradictions, the likes of which I’ve never witnessed.

Affiliates affected and find the problem
Meanwhile an affiliate began to see the problem and independently examined it. They made much faster progress than we did. Within a day they found the reason (suggested by their networking person from the telecom, who apparently is much better than ours): the server hello packet has the expedited forwarding (EF) flag set in the differentiated code services point (DSCP) section of the IP header.

Say what?
So I really got schooled on this one. I was saying It has to be an application-aware “something” on the network or PC that is purposefully messing around with the SSL communication. That’s what the evidence screamed to me. So a PC-based firewall seemed a strong contender and that is how Regal was thinking.

So the affiliate explained it this way: the company uses QOS on their routers. Phone (VOIP) gets priority and is the only application where the EF bit is expected to be set. VOIP packets are small, by the way. Regular applications like web sites should just use the default QOS. And according to Wikipedia, many organizations who do use QOS will impose thresholds on the EF pakcets such that if the traffic exceeds say 30% of link capacity drop all packets with EF set that are over a certain size. OK, maybe it doesn’t say that, but that is what I’ve come to understand happens. Which makes the dropping of these particular packets the correct behaviour as per the company’s own WAN contract and design. Imagine that!

Smoking gun no more
So now my smoking gun – blame it on the network for dropped packets – is turned on its head. Cisco has set this EF bit on its server hello response on the loginp.webexconnect.com web site. This is undesirable behaviour. It’s not a phone call after all which requires a minimum jitter in packet timing.

So next time I did a trace I found that instead of EF flag being set, the AF (Assured Forwarding) flag was set. I suppose that will make handling more forgiving inside the company’s network, but I was told that even that was too much. Only default value of 0 should be set for the DSCP value. This is an open issue in Cisco’s hands now.

But at least this explains most observations. Small MTU worked? Yup, those packets are looked upon more favorably by the routers. One proxy worked, the other did not? Yup, they are in different data centers which have different bandwidth utilization. The one where it was not working has higher utilization. Only affected users are at WAN sites? Yup, probably only the WAN routers are enforcing QOS. Worked over VPN, even on a PC showing the problem? Yup – all VPN users use a LAN connection for their proxy settings. Fabricated SSL fatal error packet? I’m still not sure about that one – guess the router sent it as a courtesy after it decided to drop the server hello – just a guess. Problem fixed by shutting down Kaspersky? Nope, guess that was a red herring. Every problem has dead ends and red herrings, just a fact of life. And anyway that behaviour was not very consistent. Problem started November 17th? Yup, the affiliate just happened to have a baseline packet trace from November 2nd which showed that DSCP was not in use at that time. So Cisco definitely changed the behaviour of Cisco Jabber sometime in the intervening weeks. Other web sites worked, except this one? Yup, other web sites do not use the DSCP section of the IP header so it has the default value of 0.

Conclusion
Cisco has decided to remove the DSCP flag from these packets, which will fix everything. Perhaps EF was introduced in support of Cisco Jabber’s extended use as a soft phone??? Then this company may have some re-design of their QOS to take care of because I don’t see an easy solution. Dropping the MTU on the proxy to 512 seems pretty drastic and inefficient, though it would be possible. My reading of TCP is that nothing prevents QOS from being set on any sort of TCP packet even though there may be a gentleman’s agreement to not ordinarily do so in all except VOIP packets or a few other special classes. I don’t know. I’ve really never looked at QOS before this problem came along.

The company is wisely looking for a way to set all packets with DSCP = 0 on the Intranet, except of course those like VOIP where it is explicitly supposed to be used. This will be done on the Internet router. In Cisco IOS it is possible with a policy map and police setting where you can set set-dscp-transmit default. Apparently VPN and other things that may check the integrity of packets won’t mind the DSCP value being altered – it can happen anywhere along the route of the packet.

Boy applications these days are complicated! And those rare times when they go wrong really require a bunch of cooperating experts to figure things out. No one person holds all the expertise any longer.

My simplistic paradigm of its either the PC or the network had to make room for a new reality: it’s the web site in the cloud that did them in.

Could other web sites be similarly affected? Yes it certainly seems a possibility. So I now know to check for use of DSCP if a particular web site is not working, but all others are.

References and related
This Wikipedia article is a good description of DSCP: https://en.wikipedia.org/wiki/Differentiated_services

Categories
Network Technologies Proxy

The IT Detective Agency: The case of the purloined packet

Intro
Someone in the org notices that laptops can’t connect to our VPN concentrator when using Verizon MiFi. An investigation ensues and culprits are found. Read on for the details…

The details

Some background
The laptops are configured to use an explicit PAC file (proxy auto-config) for proxy access to the Internet. It works great when they’re on the Intranet. There were some initial bumps in the road, so we found the best results (as determined by using normal ISPs such as CenturyLink) for successful sign-on to VPN and establishing the tunnel was to create a dummy service on the Internet for the PAC file server. The main thing that does is send a RESET packet every time someone requests a PAC file. In that way the laptop knows that it should quickly give up on retrieving the PAC file and just use direct Internet connections for its initial connection to the VPN concentrator. OK?

So with that background, this current situation will make more sense. sometime after we developed that approach someone from the organization came along and tried to connect to VPN using a Verizon MiFi device. Although in the JunOS client it shows Connected in actual fact no tunnel is ever established so it does not work.

Time for to try some of our methodical testing to get to the bottom of this. Fortunately my friend has a Verizon MiFi jetpack, so I hooked myself up. Yup, same problem. It shows me connected but I haven’t been issued a tunnel IP (as ipconfig shows) and I am not on the internal network.

Next test – no PAC file
Then I turned off use of a PAC file and tried again. I connected and brought up a tunnel just fine!

Next test – PAC file restored, use of normal ISP

Then I restored the PAC file setting and tried my regular ISP, CenturyLink. I connected just fine!

Next test – try with a Verizon Hotspot

I tried my Verizon phone’s Hotspot. It also did not work!

I performed these tests several times to make sure there weren’t any flukes.

I also took traces of all these tests. To save myself time I won’t share the traces here like I normally do, but describe the relevant bits that I feel go into the heart of this case.

The heart of the matter
When the laptop is bringing up the VPN tunnel there is always at least one and usually several requests for the PAC file (assuming we’re in a mode where PAC file is in use). Remember I said above that the PAC file on the Internet is “served” by a dummy server that all it does is respond to every request (at TCP level every SYN) with a RST (reset)?

Well, that’s exactly what I saw on the laptop trace when using CenturyLink. But is not what I saw when using either Verizon MiFi or HotSpot. No, in the Verizon case the laptop is sent a (SYN,ACK). So the laptop in that case finishes the TCP handshake with an ACK and then makes a full HTTP request for the PAC file (which of course it never receives).

Well, that is just wrong because now the laptop thinks there’s a chance that it might get the PAC file if it just waits around long enough, or something like that.

The main point is that Verizon – probably with the best intentions – has messed with our TCP packets and the laptop’s behavior is so different as a result that it breaks this application.

I’m still thinking about what to do. I doubt Verizon, being a massive organization, is likely to change their ways, but we’ll see. UPDATE. After a few weeks and poor customer support from Verizon, they want to try some more tests. Verizon refuses to engage in a discussion at a technical level with us and ignores all our technical points in this open case. They don’t agree or disagree, but simply ignore it. Then they cherry pick some facts which support their preconceived notion of the conclusion. “It work with the PAC file turned off so the only change is on your end” is pretty much an exact quote from their “support.”

We also asked Verizon for all their ranges so we could respond differently to Verizon users, but they also ignored that request.

In particular What Verizon is doing amounts to WAN optimization.

UPDATE – a test with Sprint
I got a chance to try yet a different Telecom ISP: Sprint. I expected it to work because we don’t have any complaints, and indeed it did. I tested with a Sprint aircard someone lent me. But the surprising thing – indeed amazing thing – is how it worked. Sprint also clobbers those RST packets from when the laptop if fetching the PAC file. They also turn them into SYN,ACKs. But they carry the deceit one step further. When the laptop then does the GET request, they go so far as to fake the server response! They respond with a short 503 server fail status code! The laptop makes a few attempts to get the PAC file, always getting these 503 responses and then it gives up and it is happy to establish the VPN tunnelled connection at that point! So this has got me thinking…

April, 2014 update
Verizon actually did continue to work with us. They asked us the IP of our external PAC file server, never revealing what they were going to do with that information. Then we provided them phone numbers of Verizon devices we wished to test with. As an aside did you know that a Verizon MiFi Jetpack has a phone number? It does. Just log into it at my.jetpack and it will be displayed.

So I tried the Jetpack after they did their secret things, and, yes, VPN now works. But we are not helpless. I also traced that session, and in particular the packets to and from the PAC file webserver. Yes, they are coming back to my laptop as RSTs. I tested a second time, because once can be a fluke. Same thing: successful connection and RST packets as we wanted them to be.

Now we just need the fix to be generalized to all Verizon devices.

May 2014 update
Finally, finally, on May 6th we got the word that the fix has been rolled out. They are “bypassing” the IPs of our PAC file webservers, which means we get to see the RST packets. I tested an aircard and a Verizon hotspot and all looked good. VPN was established and RST packets were observed.

Case finally closed.

Conclusion
Verizon has done some “optimizations” which clobber our RST packets. It is probably pre-answering SYNs, with ACKs, which under ordinary circumstances probably helps performance. But it is fatal to creating an SSL VPN tunnel with the JunOS client configured with a PAC file. Sprint optimizes as well, but in such a way that things actually work. Verizon actually worked with us, at their own slow pace and secretive methodology, and changed how they handled those packets and things began to work correctly.

References
The next two links are closely related to this current issue.
The IT Detective Agency: How We Neutralized Nasty DNS Clobbering Before it Could Bite Us.
The IT Detective Agency: Browsing Stopped Working on Internet-Connected Enterprise Laptops

Categories
Admin DNS Proxy

The IT Detective Agency: Browsing Stopped Working on Internet-Connected Enterprise Laptops

Intro
Recall our elegant solution to DNS clobbering and use of a PAC file, which we documented here: The IT Detective Agency: How We Neutralized Nasty DNS Clobbering Before it Could Bite Us. Things were going smoothly with that kludge in place for months. Then suddenly it wasn’t. What went wrong and how to fix?? Read on.

The Details
We began hearing reports of increasing numbers of users not being able to access Internet sites on their enterprise laptops when directly connected to Internet. Internet Explorer gave that cryptic error to the effect Internet Explorer cannot display the web page. This was a real problem, and not a mere inconvenience, for those cases where a user was using a hotel Internet system that required a web page sign-on – that web page itself could not be displayed, giving that same error message. For simple use at home this error may not have been fatal.

But we had this working. What the…?

I struggled as Rop demoed the problem for me with a user’s laptop. I couldn’t deny it because I saw it with my own eyes and saw that the configuration was correct. Correct as in how we wanted it to be, not correct as in working. Basically all web pages were timing out after 15 seconds or so and displaying cannot display the web page. The chief setting is the PAC file, which is used by this enterprise on their Intranet.

On the Internet (see above-mentioned link) the PAC file was more of an annoyance and we had aliased the DNS value to 127.0.0.1 to get around DNS clobbering by ISPs.

While I was talking about the problem to a colleague I thought to look for a web server on the affected laptop. Yup, there it was:

C:\Users\drj>netstat -an|more

Active Connections

  Proto  Local Address          Foreign Address        State
  TCP    0.0.0.0:80             0.0.0.0:0              LISTENING
  TCP    0.0.0.0:135            0.0.0.0:0              LISTENING
  ...

What the…? Why is there a web server running on port 80? How will it respond to a PAC file request? I quickly got some hints by hitting my own laptop with curl:

$ curl -i 192.168.3.4/proxy.pac

HTTP/1.1 404 Not Found
Content-Type: text/html; charset=us-ascii
Server: Microsoft-HTTPAPI/2.0
Date: Tue, 17 Apr 2012 18:39:30 GMT
Connection: close
Content-Length: 315


Not Found

Not Found


HTTP Error 404. The requested resource is not found.

So the server is Microsoft-HTTPAPI, which upon invetsigation seems to be a Microsoft Web Deployment Agent Service (MsDepSvc).

The main point is that I don’t remember that being there in the past. I felt it’s presence, probably a new “feature” explained the current problem. What to do about it however??

Since this is an enterprise, not a small shop with a couple PCs, turning off MsDepSvc is not a realistic option. It’s probably used for peer-to-peer software distribution.

Hmm. Let’s review why we think our original solution worked in the first place. It didn’t work so well when the DNS was clobbered by my ISP. Why? I think because the ISP put up a web server when it encountered a NXDOMAIN DNS response and that web server gave a 404 not found error when the browser searched for the PAC file. Turning the DNS entry to the loopback interface, 127.0.0.1, gave the browser a valid IP, one that it would connect to and quickly receive a TCP RST (reset). Then it would happily conclude there was no way to reach the PAC file, not use it, and try DIRECT connections to Internet sites. That’s my theory and I’m sticking to it!

In light of all this information the following option emerged as most likely to succeed: set the Internet value of the DNS of the PAC file to a valid IP, and specifically, one that would send a TCP RST (essentially meaning that TCP port 80 is reachable but there is no listener on it).

We tried it and it seemed to work for me and my colleague. We no loner had the problem of not being able to connect to Internet sites with the PAC file configured and directly connected to Internet.

I noticed however that once I got on the enterprise network via VPN I wasn’t able to connect to Internet sites right away. After about five minutes I could.

My theory about that is that I had a too-long TTL on my PAC file DNS entry. The TTL was 30 minutes. I shortened it to five minutes. Because when you think about it, that DNS value should get cached by the PC and retained even after it transitions to Intranet-connected.

I haven’t retested, but I think that adjustment will help.

Conclusion
I also haven’t gotten a lot of feedback from this latest fix, but I’m feeling pretty good about it.

Case: again mostly solved.

Hey, don’t blame me about the ambiguity. I never hear back from the users when things are working 🙂

References
A closely related case involving Verizon “clobbering” TCP RST packets also bit us.

Categories
Admin IT Operational Excellence Proxy

The IT Detective Agency: the case of the Sales and Use tax software

Intro
I have to give credit to my colleague “Ben” for cracking this case, which left me scratching my head. Users at drjohns were getting new Windows 7 PCs and some of the old software wasn’t going to run on those new PCs, including our indirect tax sales and use software from Thomson Reuters. The new approach is SaaS – software as a service. The new package was approved and everyone thought it was going to work fine, until late in the game it was actually tested. They couldn’t bring up their old tax returns. So at the last hour they bring in the Internet experts.

The Details
At drjohns our users are insulated from the Internet by proxy servers. There are no direct routes. It’s private address space and an explicit proxy connection to browse out to Internet. 99% of the time this works fine. And it sure is a secure way to go. But those exceptions can be quite a headache. This case is a very typical presentation of what we see, though the particular solution varies case by case.

We get detailed network requirements. They usually talk about opening up the firewall to certain servers, etc. We always patiently explain that the firewall is open – to the proxy! The desktops have no Internet routes, nor can they resolve Internet domain names. That’s right we have private root DNS servers. Most vendors have never encountered this setup and so they dig in their heels and insist that the only way is to ‘open the firewall…”

This case was no different, except we didn’t actually talk to the vendor. But their requirements were crystal clear in this networking document. Here’s the snippet that would seem to be fatal given our Intranet architecture:

RS APPLICATION SERVERS
The ONESOURCE Sales & Use Application Servers use TCP/IP communications from the client
PC to the Server. The requirements for communications with the ONESOURCE Application
Servers are itemized below:
- DNS Name Resolution is not used for the Application Servers.
- Proxy Server access to the Application Servers is supported ONLY in transparent mode. The
Proxy Server must not translate the TCP/IP address of the Application Servers. PCs must be
able to establish a connection using the actual TCP/IP address and port numbers of the
Application Server without application “awareness” of a Proxy Server.
- Network Address Translation (NAT) is supported for the client addresses but is NOT
supported for the Application Server addresses.
- Connections are outbound only from the client to the server.
- Security policies, firewall rules, proxy rules and router packet filters must allow outbound
connections (and inbound replies) on destination port 2429 to the Class “B” network address
164.57.0.0. when using the non-WCF application servers. If the client’s account has been
configured to use Windows Communications Foundation or WCF, there are no additional port
requirements. The source port selection uses standard port numbers 1024 and above.

The application installs about 10 ActiveX controls and it wouldn’t run on my desktop. Ben managed to get it to run using the OpenText socks client. It has an option to “socksify everything else” which he says proves to be very useful when you don’t know what specific application to socksify. So now let me repeat what I have just said: Ben got it to work without any changes to the firewall, ignoring all the vendor’s advice and requirements!

I was very pleased as this was getting to be a high-priority issue what with these sales and use taxes due each month.

But Ben didn’t stop there. He came up with even better solution. He said he was looking around at the folder where all the stuff is installed by the application. He noticed a file called ConfigProxy. He configured it to use the system proxy settings. Then he exempted the target site from proxy authentication. Lo and behold that worked as well, with no socksification required at all. We only socksify an app as a last resort.

This latest finding completely contradicts the vendor’s stated network requirements. But it’s better this way.

We now have a happy tax department. Case closed.

Conclusion
Vendor network requirements are not always what they seem. Clearly they are not testing in the more obscure environments such as a private Intranet with an independent namespace that connects to the wider Internet only via explicit proxy. If you’re in this situation, which offers some serious security advantages, there are things you can do to get demanding applications to work.

Categories
Admin IT Operational Excellence Linux Proxy Web Site Technologies

The IT Detective Agency: intermittent web page not found error

Intro
One of the high arts of IT is system integration, and an important off-shoot of this is acquisitions. We are involved in integrating a new location, which, unfortunately, we do not yet have full access to. The local networking is still provided by their vendor, not ours and this makes troubleshooting all the more difficult.

The Details
So the word begins to spread that users at this site are having intermittent problems accessing some of our secure web sites. As it was described to me, they can load the page in their browser for, say five straight times, get a simple Internet Explorer cannot display the web page error, and the sixth time (or whenever) it will load properly. All other connectivity was working. No one else at other locations was having this problem with this web site. More than strange, right?

In drjohn’s perfect IT world, problem reproducibility is critical to resolution, but we simply didn’t have it this time. I also could not produce the problem myself, which means relying on other people.

I’m not sure if we tried to contact their vendor or not at first. But if we had I’m sure they would have denied having anything to do with it.

So we got one of our confederates, Tim, over to this location and we hooked him up with Wireshark so he could get take a packet trace when the failure occurs. It wasn’t long before Tim reproduced the error and emailed us the packet capture.

In the following the PC has IP address 10.200.23.34, the web server is at 10.4.5.6. The Linux command used to look at the capture file is:

# tcpdump -A -r bodega-error.cap port 443 > /tmp/dump

1 15:54:27.495952 IP 10.200.23.34 > 10.4.5.6.https: S 2803722614:2803722614(0) win 64240 <mss 1460,nop,wscale 0,nop,nop,sackOK>
2 15:54:27.496309 IP 10.4.5.6.https > 10.200.23.34: S 3201081612:3201081612(0) ack 2803722615 win 5840 <mss 1432,nop,nop,sackOK>
3 15:54:27.496343 IP 10.200.23.34 > 10.4.5.6.https: . ack 1 win 64240
4 15:54:27.497270 IP 10.200.23.34 > 10.4.5.6.https: P 1:82(81) ack 1 win 64240
5 15:54:27.497552 IP 10.4.5.6.https > 10.200.23.34: . ack 82 win 5840
6 15:54:30.743827 IP 10.4.5.6.https > 10.200.23.34: P 1:286(285) ack 82 win 5840
..S.......^M..i.P.......HTTP/1.0 200 OK^M
Cache-Control: no-store^M
Pragma: no-cache^M
Cache-Control: no-cache^M
X-Bypass-Cache: Application and Content Networking System Software 5.5.17^M
Connection: Close^M
^M
<HTML><HEAD><META HTTP-EQUIV="REFRESH" CONTENT="0;URL=https://10.4.5.6/"></HEAD><BODY>
</BODY></HTML>
 
7 15:54:30.744036 IP 10.200.23.34 > 10.4.5.6.https: F 82:82(0) ack 286 win 63955
8 15:54:30.744052 IP 10.4.5.6.https > 10.200.23.34: F 286:286(0) ack 82 win 5840
9 15:54:30.744077 IP 10.200.23.34 > 10.4.5.6.https: . ack 287 win 63955
10 15:54:30.744289 IP 10.4.5.6.https > 10.200.23.34: . ack 83 win 5840

The output was scrubbed a bit of meaningless junk characters and I added serial packets numbers in the beginning by hand because I don’t (yet) know how to do that with tcpdump!

What, It’s Encrypted – what can you even learn from a trace?
Yeah, an SSL stream sure adds to the already steep challenges we faced in this problem. There just isn’t much to work with. But it is something. I’m about to say what I noticed in this packet trace, but for it to be meaningful you need to know like I did that the web server is situated almost four thousand miles from the user’s location.

The first packet is a SYN from the PC to web server on TCP port 443. So far so good. In fact packets one – three constitute the three-way handshake in TCP.

Although SSL is encrypted, the beginning of the protocol communication should show the SSL cipher being chosen. Unfortunately, tcpdump doesn’t seem to have the smarts to show any of this. So I got myself ssldump. On Ubuntu:

# sudo apt-get install ssldump

did the trick. Then run this same capture file through ssldump, which has very similar arguments to tcpdump:

# ssldump -r bodega-error.cap port 443

New TCP connection #1: 10.200.23.34(2027) <-> 10.4.5.6(443)
1 1  0.0013 (0.0013)  C>S SSLv2 compatible client hello
  Version 3.1
  cipher suites
  TLS_RSA_WITH_RC4_128_MD5
  TLS_RSA_WITH_RC4_128_SHA
  TLS_RSA_WITH_3DES_EDE_CBC_SHA
  SSL2_CK_RC4
  SSL2_CK_3DES
  SSL2_CK_RC2
  TLS_RSA_WITH_DES_CBC_SHA
  SSL2_CK_DES
  TLS_RSA_EXPORT1024_WITH_RC4_56_SHA
  TLS_RSA_EXPORT1024_WITH_DES_CBC_SHA
  TLS_RSA_EXPORT_WITH_RC4_40_MD5
  TLS_RSA_EXPORT_WITH_RC2_CBC_40_MD5
  SSL2_CK_RC4_EXPORT40
  SSL2_CK_RC2_EXPORT40
  TLS_DHE_DSS_WITH_3DES_EDE_CBC_SHA
  TLS_DHE_DSS_WITH_DES_CBC_SHA
  TLS_DHE_DSS_EXPORT1024_WITH_DES_CBC_SHA
  Unknown value 0xff
Unknown SSL content type 72
1    3.2480 (3.2467)  C>S  TCP FIN
1 2  3.2481 (0.0000)  S>CShort record
1    3.2481 (0.0000)  S>C  TCP FIN

The way to interpret this is that 0.0013 s into the TCP port 443 communication the cipher suites listed above were sent by the PC to the server. This corresponds to our packet number 4 in the trace file.

Using Wireshark to look at the trace is a lot more convenient – it provides packet numbers, timing, decodes packets and displays the SSL ciphers. But I wanted to show that it _could_ be done with text-based tools.

Look at the timings more closely. In the tcpdump output, packet 2, the SYN ACK, comes 1 ms after the SYN. But given the distances involved between PC and server, the SYN ACK should have come more like 100 ms later, at least. Similarly packet 5, which is an ACK, comes less than 1 ms after packet 4. A 1 ms ACK? Physically impossible.

I have seen this behaviour before – on our own load balance – which I know employs some TCP optimization tricks. So I concluded that they must have physically present at this site some kind of appliance which is doing TCP optimization. It can only provide blank ACKs in its rapid-fire responses since it can’t know what data the server is really going to respond with. That might all be OK. But I’m pretty sure the problem lies between packets 5 and 6. 5 is one of those meaningless rapid-fire empty ACKs generated by the local router. But the PC has just sent a wish list of SSL ciphers in packet 4. It needs to be responded to by the server which has to finish setting up the SSL session.

But that critical packet from the server never arrives. Perhaps even some of the SSL handshake is secretly completed between the local router and the server. Who knows? I have heard of man-in-the-middle devices that decrypt SSL sessions. And packet 6 contains fairly inappropriate content. It almost does look like it has been manufactured by a man-in-the-middle device. Its telling the browser to do a redirect to the same site, except specified by IP address rather than FQDN. And that doesn’t make a lot of sense. The browser likely realizes that this amounts to a looping redirect request so at that point it probably decides to cut its losses and FIN the connection in packet 7.

I traced my own PC hitting this same web server. Now I know we don’t have any of these optimizing devices between me and the web server. I don’t have time to show the results here, but to summarize, it looks rather completely different from the trace above. The ACK packets come back in about 100 ms or so. There is no delay of three seconds. The cipher proposals are responded to in a timely fashion. There is no redirect.

Their Side of the Story
We did get to hear back from the vendor who supports the LAN/WAN. They said they were running WCCP and diverting traffic to a proxy server. This was the correct behaviour before we hooked our infrastructure to this site, but is no longer. They realized this was probably a bad thing and took corrective action to turn off WCCP for destinations in the internal network 10.0.0.0/8.

Conclusion
Shutting off WCCP, which diverted web site requests to an old proxy server, fixed the problem.

Case closed.

Unsolved Mysteries
I wish we could tie all the loose ends neatly up, but there are too many players involved. We’ll never really know why the problem was intermittent, for instance. Or why some secure web sites could be accessed without any issue whatsoever throughout this ordeal.

WCCP, Web cache Communication Protocol, is a Cisco-developed routing protocol to transparently intercept traffic destined for web servers. More information can be found on it in wikipedia.

It bothers me that after the SSL session was initiated the dump showed the source, unencrypted, of the HTML redirect packet. Why wasn’t that encrypted? Perhaps the WCCP-invoked proxy server was desperately trying to help the PC recover from an unrecoverable situation and manufactured that HTTP-EQUIV REFRESH… to try to force the PC to choose a web site that might work. The fact that it was sent unencrypted over a channel that should have been encrypted was probably even the death bell that triggered the browser to think this makes no sense at all and is even a violation of security, I’m getting out of here.