Categories
TCP/IP Uncategorized Web Site Technologies

The IT Detective Agency: web site not accessible

Intro
In this spellbinding segment we examine what happened when a user found an inaccessible web site.


Some details
The user in a corporate environment reports not being able to access https://login.smartnotice.net/. She has the latest version of Windows 10.


On the trail
I sense something is wrong with SSL because of the type of errors reported by the browser. Something to the effect that it can’t make a secure connection.


But I decided to doggedly pursue it because I have a decent background in understanding SSL-related problems, and I was wondering if this was the first of what might be a systemic problem. I’m always interested to find little problem and resolve them in a way that addresses bigger issues.


So the first thing I try to lean more about the SSL versions and ciphers supported is to use my Go-To site, ssllabs.com, Test your Server: https://www.ssllabs.com/ssltest/. Well, this test failed miserably, and in a way I’ve never seen before. SSLlabs just quickly gave up without any analysis! So we pushed ahead, undaunted.


So I hit the site with curl from my CentOS 8 server (Upgrading WordPress brings a thicket of problems). Curl works fine. But I see it prefers to use TLS 1.3. So I finally buckle down and learn how to properly cnotrol the SSL/TLS version in curl. The output from curl -help is misleading, shall we say?


You think using curl –tlsv1.2 is going to use TLS v 1.2? Think again. Maybe it will, or maybe it won’t. In fact it tells curl to use TLS version 1.2 or higher. I totally missed understanding that for all these years.
What I’m looking for is to determine if the web site is willing to use TLS v 1.2 in addition to TLS v 1.3.


The ticket is … –tls-max 1.2 . This sets the maximum TLS version curl will use to access the URL.


So we have
curl -v –tls-max 1.3 https://login.smartnotice.net/

<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1; mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:-469750017 -1073732485 9 0 511 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin-top:0in; margin-right:0in; margin-bottom:8.0pt; margin-left:0in; line-height:107%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri",sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:Calibri; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; font-family:"Calibri",sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:Calibri; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} .MsoPapDefault {mso-style-type:export-only; margin-bottom:8.0pt; line-height:107%;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in; mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;} div.WordSection1 {page:WordSection1;} -->
*   Trying 104.18.27.134...
* TCP_NODELAY set
* Connected to login.smartnotice.net (104.18.27.134) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
...
html head

But

curl -v –tls-max 1.2 https://login.smartnotice.net/

<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1; mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-charset:0; mso-generic-font-family:swiss; mso-font-pitch:variable; mso-font-signature:-469750017 -1073732485 9 0 511 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin-top:0in; margin-right:0in; margin-bottom:8.0pt; margin-left:0in; line-height:107%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri",sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:Calibri; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; font-family:"Calibri",sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:Calibri; mso-fareast-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} .MsoPapDefault {mso-style-type:export-only; margin-bottom:8.0pt; line-height:107%;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in; mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;} div.WordSection1 {page:WordSection1;} -->
*   Trying 104.18.27.134...
* TCP_NODELAY set
* Connected to login.smartnotice.net (104.18.27.134) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS alert, protocol version (582):
* error:1409442E:SSL routines:ssl3_read_bytes:tlsv1 alert protocol version
* Closing connection 0
curl: (35) error:1409442E:SSL routines:ssl3_read_bytes:tlsv1 alert protocol version

So now we know, this web site requires the latest and greatest TLS v 1.3.
Even TLS 1.2 won’t do.

Well, this old corporate environment still offered users a choice of old
browsers, including IE 11 and the old Edge browser. These two browsers simply do not support TLS 1.3. But I fuond even Firefox wasn’t working, although the Chrome browser was.

How to explain all that? How to fix it?

It comes down to a good knowledge of the particular environment. As I think I stated, the this corporate environment uses proxies, which in turn, most
likely, tried to SSL intercept the traffic. The proxies are old so they in turn
don’t actually support SSL interception of TLS v 1.3! They had separate
problems with Chrome browser so they weren’t intercepting its traffic. This explains why FF was broken yet Chrome worked.

So the fix, such as it was, was to disable SSL interception for this request
URL so that Firefox would work, and tell the user to use either FF or Chrome.

Just being thorough, when i tested from home with Edge Chromium – the newer Edge browser – it worked and SSLlabs showed (correctly) that it supports TLS 1.3. Edge in the corporate environment is the older, non-Chromium one. It seems to max out at TLS 1.2. No good.

For good measure I explained the situation to the desktop support people.

Case: closed.

Appendix

How did I decide the proxies didn’t support TLS 1,3? What if this site had some other issue after all? I looked on the web for another web site which only supports TLS 1.3. I thought hopefully badssl.com would have one. But they don’t! Undaunted yet again, I determined to change my own web site, drjohnstechtalk.com, into one that only supports TLS 1.3! This is easy to do with apache web server. You basically need a line that looks like this:

SSLProtocol all -SSLv3 -TLSv1 -TLSv1.1 -TLSv1.2

Categories
Network Technologies TCP/IP

Quick Tip: Why Windows traceroute works better than Linux

Intro
We noticed when debugging with the always useful tool traceroute (tracert on Windows systems) that we got more responsive results from Windows than from a Linux server on the same or nearby network. Finally I decided to look into it, my Linux pride at stake!

what it is is that Windows tracert utility uses ICMP by default whereas Linux traceroute uses UDP packets. We had been testing on a corporate Intranet where the default firewall policy was to allow ICMP but deny everything else.

The fix
Just add a ‐I switch to your linux traceroute command and the results will be as good as Windows. That switches the packet type to ICMP.

On the Internet where that type of firewall is more uncommon it probably won’t make that much a difference. But on an Intranet it could be just the thing you need.

Categories
Network Technologies TCP/IP

Spin up your own VPN with OpenVPN

Intro
I recently visited a foreign country where I was unable to watch an Amazon Prime Original show because of my location. Annoyed, I decided then and there to investigate OpenVPN. I am an ideal candidate – I already run my own Linux server in the Amazon cloud, and I know Linux and networking, so I’ve pretty much got all the ingredients already present. The software is free and I will incur no additional cost if I ever do get it working since I already pay for my server which is primarily used as a low-demand web server. Little did I know what I was getting myself into!

The details
I seem to have made every mistake in the book, but I have a strong grasp of the networking involved so I was confident in my general approach. Seeing how configurable the software is I decided to ramp up my efforts incrementally, introducing more and more features until I arrived to the minimum desired (still not there, by the way, but getting close!).

The package on various OSes
Conveniently, openvpn is an installable package on SLES and Centos. On SLES a
zypper install openvpn suffices whereas in CentOS a yum install openvpn does the trick. In Debian Linux such as Raspbian, an apt-get install openvpn works to install it.

Then look at the examples at the bottom of the man page for openvpn. I picked two servers where I have root and tried to replicate their most basic example. I think this is really important to crawl before you run. To paraphrase it:

Example 1: A simple tunnel without security
       On may:
 
              openvpn --remote june.kg --dev tun1 --ifconfig 10.4.0.1 10.4.0.2 --verb 9
 
       On june:
 
              openvpn --remote may.kg --dev tun1 --ifconfig 10.4.0.2 10.4.0.1 --verb 9
 
       Now verify the tunnel is working by pinging across the tunnel.
 
       On may:
 
              ping 10.4.0.2
 
       On june:
 
              ping 10.4.0.1

june.kg and may.kg are the IP addresses or FQDNs of the june and may server, respectively.

If you installed from a package you probably don’t need these preliminary steps they mention:

              mknod /dev/net/tun c 10 200
 
              modprobe tun

I checked it by a directory listing of /dev/net/tun:

crw-rw-rw- 1 root root 10, 200 Jan  6 08:35 /dev/net/tun

and running modprobe tun for good measure.

And, yes, their basic example worked. How? The command creates a virtual adapter, tun0, specially designed for tunnels, which records the private tunnel IP of the server itself as well as the IP of the tunnel endpoint on the other server.

It’s so cool. What’s not well explained is that you ought to choose completely private IPs for building your tunnels that don’t interfere with your other private IPs. You’ll see I eventually settled on 172.27.28.29 – I’ve never seen that range used anywhere.

You quit running the openvpn command and the tun0 adapter is destroyed. It’s very tidy.

A small word about routing for now
What about routing. How does that work? What you probably didn’t appreciate is that in this simple example although you can ping each other using the tunnel IP, you really can’t do any more than that unless you start to introduce additional routes. So as is it’s a long way from where we need to be. More on routing later. Check your route table with a netstat -rn

Finding the right example is surprisingly hard
For such a popular program you’d think examples of what I’m trying to do – change the effective IP of my laptop – would abound and implementation would be a piece of cake. But alas, nothing could be further from the truth. I’ve yet to see a complete example so I cobbled together things from various places.

You gotta understand that for a lot of people with enough tech-savvy to write up what they did, I guess they were just tickled pink to be able to tunnel through their home router and access their home networks. That’s fine and all, but it’s not all that pertinent to my usage, where the routing considerations are pretty different.

Anywho, this article is really helpful, though not sufficient by itself:

https://openvpn.net/index.php/open-source/documentation/miscellaneous/78-static-key-mini-howto.html

That describes a single server/client setup with simple security.

Server config
After a lot of debugging and incremental steps, I’m currently using this file with filepath /etc/openvpn/server.conf on my Amazon AWS server:

# -DrJ 1/5/16
# simple one server/one client setup...
# https://openvpn.net/index.php/open-source/documentation/miscellaneous/78-static-key-mini-howto.html
# static.key generated with openvpn --genkey --secret static.key
# iptables NAT discussion: http://www.karlrupp.net/en/computer/nat_tutorial
# using (simple udp test worked): iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
# list: iptables -t nat -L -v
# dig on RasPi: from dnsutils
dev tun
# 1194 is default, but...
port 2096
ifconfig 172.27.28.29 172.27.28.30
secret static.key
#use compresison
comp-lzo
# resist failures
keepalive 10 60
ping-timer-rem
persist-tun
persist-key
# run as daemon
user nobody
group nobody
daemon
proto tcp-server

An experimental client config file currently looks like this:

# for understanding what openvpn can do...
# simple single server/client setup from https://openvpn.net/index.php/open-source/documentation/miscellaneous/78-static-ke
y-mini-howto.html
# - DrJ 1/6/16
remote drjohnstechtalk.com
# 1994 is default but I'll use this one...
port 2096
dev tun
ifconfig 172.27.28.30 172.27.28.29
secret static.key
# resist failures
keepalive 10 60
ping-timer-rem
persist-tun
persist-key
#use compresison
comp-lzo
# for testing
proto tcp-client
# a test host route
#route 172.16.0.23
# a test network route which includes endpoint
route 172.16.0.0 255.240.0.0
# stands for broad, Internet route which may overlap our route to the openvpn server: doesn't work by itself!
route drjohnstechtalk.com 255.255.255.255 net_gateway
route 50.17.188.0 255.255.255.0

Works through proxy
Amazingly, I can confirm openvpn works through a standard http proxy. This is both cool and a little scary. To the above config file I added something like this:

# use proxy which requires basic authentication
http-proxy proxy-1.johnstechtalk.com 8080 authfile

where authfile is in the same directory as client.conf (/etc/eopnvpn) and has the proxy username and password on two separate lines.

You have to run your server in TCP mode in order to access it from a client that’s behind an http proxy however, hence the proto tcp-server in the config line of the server and the proto tcp-client line in the client config. I’ll experiment with that to see if that’s costing performance.

A little more on routing
Why the circumspect routes? I’m deathly afraid of locking myself out of these servers by implementing the wrong routes! And I learned a lot bootstrapping my way to broader routes. The thing is, I noticed my AWS server uses this private IP address for its DNS server: 172.16.0.23 (cat /etc/resolv.conf, for example). So I realized that I could introduce a host route to that IP on the openvpn client to begin the arduous process of building up and testing real routing. How to test? Simple. With commands like

> dig ns johnstechtalk.com @172.16.0.23

on the client.

Dig is a useful networking tool. I describe installing it on Windows systems in this post.

Point of concern over performance
Even before expanding the routes I am concerned about TCP performance. I noticed that when I add the +tcp option to dig to force the query to use TCP I get a big performance penalty. a regular query that takes 60 msec to the AWS DNS server from my PC takes anywhere from 200 – 400 msec over TCP! Now there are a bunch more packets a back-and-forth, but all that aside and still there is a disconcerting performance hit of 100 msec or so. By contrast the enterprise-class VPN provided by Juniper suffers from no such TCP performance penalty – I know because I tested dig over a Juniper VPN using the JunOS Pulse client.

Up our game, try to run as a full server w/ DHCP and everything
I’ll show the config file below. Let me just mention that my first attempt didn’t work because I continued to use a shared secret (the secret declaration), but that is incompatible with a new statement I introduced on the server:

server 172.27.28.32 255.255.255.224

So I gotta bite the bullet and get my PKI infrastructure up and running. In the old days they bundled easy-rsa with openvpn. Now you have to install it separately. I was able to do a yum install easy-rsa on my CentOS instance.

Hey, easy-rsa is pretty cool – I might be able to use that for other things. I always wanted to know how to create my own CA! The instructions in the Howto are so complete that there’s really no need to go over that here. HowTo link.

I copied my /usr/share/easy-rsa/2.0/* files to /etc/openvpn/rsa and did all the pki-buildingwork there. But I don’t want to go crazy with encyrption so I downgraded diffie-hellman from 2048 to 1024 bits:

$ openssl dhparam -out dh1024.pem 1024

Full Monty – doing a full VPN
Enough warm-up. I tried full VPN and for some reason hit the right configuration on the first try.

Here’s my server.conf file on my Amazon AWS server.

# -DrJ 1/13/16
# multi-client server from https://openvpn.net/index.php/open-source/documentation/howto.html#examples
# static.key generated with openvpn --genkey --secret static.key
# iptables NAT discussion: http://www.karlrupp.net/en/computer/nat_tutorial
# using (simple udp test worked): iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
# list: iptables -t nat -L -v
port 1194
proto udp
dev tun
ca rsa/keys/ca.crt
cert rsa/keys/server.crt
key rsa/keys/server.key  # This file should be kept secret
dh rsa/keys/dh1024.pem
# pool of addresses for clients and server. server gets 172.27.28.33
server 172.27.28.32 255.255.255.224
ifconfig-pool-persist ipp.txt
# very experimental
push "redirect-gateway def1 bypass-dhcp"
# just use closest Google DNS server
push "dhcp-option DNS 8.8.8.8"
# resist failures
keepalive 10 60
ping-timer-rem
# use compression
comp-lzo
# only allow two clients max for now
max-clients 2
user nobody
group nobody
persist-key
persist-tun
status openvpn-status.log
# verbosity level. 0 - all but fatal errors, 9 - extremely verbose
verb 3

And here’s my Windows 10 client file.

# from https://openvpn.net/index.php/open-source/documentation/howto.html#examples
# - DrJ 1/17/16
client
dev tun
proto udp
remote drjohnstechtalk.com 1194
nobind
# resist failures
keepalive 10 60
ping-timer-rem
persist-tun
persist-key
#use compresison
comp-lzo
# SSL/TLS parmas
ca ca.crt
cert client1.crt
key client1.key
# stands for broad, Internet route which may overlap our route to the openvpn server: doesn't work by itself!
route drjohnstechtalk.com 255.255.255.255 net_gateway
verb 3

I was a bit concerned the DHCP stuff might not work, but it did, including assigning a Google DNS server at 8.8.8.8.

I generated the client1.crt and client1.key using easy-rsa. These plus ca.crt I copied down to my laptop. For security I then deleted client1.key on my server (so no one else can grab it).

Performance
Performance is amazing. There no longer is the penalty for doing dig over tcp. speedtest.net actually shows better results when I am connected over my VPN! This is a totally unexpected surprise. I guess that’s due to the compression, using UDP and not using overly strong encryption.

Speedtest results
speedtest.net results without VPN, i.e., regular mode: 8.0 mbps download, 1.0 mbps upload. With VPN I got 11.0 mbps download and 1.2 mbps upload.

A platform that didn’t work
You’ll see from my other blogs that I am a Raspberry Pi fan. so naturally I wanted to bring up a VPN client on the Raspberry Pi. I am stuck on this point however because unlike the SLES Linux I tested with, Raspbian brings up a tun interface but then drops the wlan0 IP address so all communication to it is lost. Maybe it would work better wired – I’ll try it and post the results.

Still more about routing
I never understood how the openvpn examples were supposed to work until I had to implement them. Indeed they generally wouldn’t work until you address some routing and NAT issues. There is no magic. I did traces to show my client was trying to communicate with its assigned private IP of 172.27.28.29. Well that’s never going to work. You have to make sure routing is enabled on your server:

On Linux, use the command:
$ echo 1 > /proc/sys/net/ipv4/ip_forward

But that’s not sufficient either. That doesn’t address that the packets from your client have the wrong IP address as far as the outside world is concerned. So you have to NAT (address translate) those packets. I don’t like iptables but this turned out to be easier than expected. As mentioned in my server config file you run:

$ iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

This is the boiled down summary from this helpful article.

Don’t forget that to look at your NAT rules you need a command like this:

$ iptables -t nat -L -v

not simply an iptables -L.

Note that this is a hide NAT – I am not making the openvpn client visible as a server on the Internet. That’s a lot harder with IP addresses being in scarce supply.

Does it work to solve my original problem? Well I’m not in a foreign country to run that test, but I can vouch that I can run Amazon Prime through openvpn. So I expect it will work overseas as well.

It works so well how do I know I’m really running through openvpn?

Many ways. For instance my routing table. From a CMD window:

C:> netstat -rn

IPv4 Route Table
===========================================================================
Active Routes:
Network Destination        Netmask          Gateway       Interface  Metric
          0.0.0.0          0.0.0.0      192.168.2.1      192.168.2.8     25
          0.0.0.0        128.0.0.0     172.27.28.37     172.27.28.38     20
    50.17.188.196  255.255.255.255      192.168.2.1      192.168.2.8     25
        127.0.0.0        255.0.0.0         On-link         127.0.0.1    306
        127.0.0.1  255.255.255.255         On-link         127.0.0.1    306
  127.255.255.255  255.255.255.255         On-link         127.0.0.1    306
        128.0.0.0        128.0.0.0     172.27.28.37     172.27.28.38     20
      169.254.0.0      255.255.0.0         On-link       192.168.2.8    306
  169.254.255.255  255.255.255.255         On-link       192.168.2.8    281
     172.27.28.33  255.255.255.255     172.27.28.37     172.27.28.38     20
     172.27.28.36  255.255.255.252         On-link      172.27.28.38    276
     172.27.28.38  255.255.255.255         On-link      172.27.28.38    276
     172.27.28.39  255.255.255.255         On-link      172.27.28.38    276
      192.168.2.0    255.255.255.0         On-link       192.168.2.8    281
      192.168.2.8  255.255.255.255         On-link       192.168.2.8    281
    192.168.2.255  255.255.255.255         On-link       192.168.2.8    281
        224.0.0.0        240.0.0.0         On-link         127.0.0.1    306
        224.0.0.0        240.0.0.0         On-link       192.168.2.8    281
        224.0.0.0        240.0.0.0         On-link      172.27.28.38    276
  255.255.255.255  255.255.255.255         On-link         127.0.0.1    306
  255.255.255.255  255.255.255.255         On-link       192.168.2.8    281
  255.255.255.255  255.255.255.255         On-link      172.27.28.38    276

Specifically note the two routes to 0.0.0.0 netmask 128.0.0.0 and 128.0.0.0 netmask 128.0.0.0 via gateway 172.27.28.37. That’s just what this experimental command in my server config file was supposed to do:

push "redirect-gateway def1 bypass-dhcp"

namely, create two broad routes to the entire Internet, just slightly more specific than a default route so they take precedence over the default route – I think it’s a clever idea. And of course ipconfig shows my private IP address:

Ethernet adapter Ethernet 2:
 
   Connection-specific DNS Suffix  . :
   Link-local IPv6 Address . . . . . : fe80::8871:c910:185e:953b%5
   IPv4 Address. . . . . . . . . . . : 172.27.28.38
   Subnet Mask . . . . . . . . . . . : 255.255.255.252
   Default Gateway . . . . . . . . . :

References and related
No patience to roll your own? Five top commercial VPN offerings are described here: http://www.itworld.com/article/3152904/security/top-5-vpn-services-for-personal-privacy-and-security.html
A lightweight dig installation method for Windows is described here.
My article about iptables.

Categories
Admin Network Technologies Proxy Security TCP/IP Web Site Technologies

The IT Detective Agency: Cisco Jabber stopped working for some using WAN connections

Intro
This is probably the hardest case I’ve ever encountered. It’s so complicated many people needed to get involved to contribute to the solution.

Initial symptoms

It’s not easy to describe the problem while providing appropriate obfuscation. Over the course of a few days it came to light that in this particular large company for which I consult many people in office locations connected via an MPLS network were no longer able to log in to Cisco Jabber. That’s Cisco’s offering for Instant Messaging. When it works and used in combination with Cisco IP phones it’s pretty good – has some nice features. This major problem was first reported November 17th.

Knee-jerk reactions
Networking problem? No. Network guys say their networks are running fine. They may be a tad overloaded but they are planning to route Internet over the secondary links so all will be good in a few days.
Proxy problem? Nope. proxy guys say their Bluecoat appliances are running fine and besides everyone else is working.
Application problem? Application owner doesn’t see anything out of the ordinary.
Desktop problem? Maybe but it’s unclear.

Methodology
So of the 50+ users affected I recognized two power users that I knew personally and focussed on them. Over the course of days I learned:
– problem only occurs for WAN (MPLS) users
– problem only occurs when using one particular proxy
– if a user tries to connect often enough, they may eventually get in
– users can get in if they use their VPN client
– users at HQ were not affected

The application owner helpfully pointed out the URL for the web-based version of Cisco Jabber: https://loginp.webexconnect.com/… Anyone with the problem also could not log in to this site.

So working with these power users who patiently put up with many test suggestions we learned:

– setting the PC’s MTU to a small value, say 512 up to 696 made it work. Higher than that it generally failed.
– yet pings of up to 1500 bytes went through OK.
– the trace from one guy’s PC showed all his packets re-transmitted. We still don’t understand that.
– It’s a mess of communications to try to understand these modern, encrypted applications
– even the simplest trace contained over 1000 lines which is tough when you don’t know what you’re looking for!
– the helpful networking guy from the telecom company – let’s call him “Regal” – worked with us but all the while declaring how it’s impossible that it’s a networking issue
– proxy logs didn’t show any particular problem, but then again they cannot look into SSL communication since it is encrypted
– disabling Kaspersky helped some people but not others
– a PC with the problem had no problem when put onto the Internet directly
– if one proxy associated with the problem forwarded the requests to another, then it begins to work
– Is the problem reproducible? Yes, about 99% of the time.
– Do other web sites work from this PC? Yes.

From previous posts you will know that at some point I will treat every problem as a potential networking problem and insist on a trace.

Biases going in
So my philosophy of problem solving which had stood the test of time is either it’s a networking problem, or it’s a problem on the PC. Best is if there’s a competition of ideas in debugging so that the PC/application people seek to prove beyond a doubt it is a networking problem and the networking people likewise try to prove problem occurs on the PC. Only later did I realize the bias in this approach and that a third possibility existed.

So I enthused: what we need is a non-company PC – preferably on the same hardware – at the same IP address to see if there’s a problem. Well we couldn’t quite produce that but one power user suggested using a VM. He just happened to have a VM environment on his PC and could spin up a Windows 7 Professional generic image! So we do that – it shows the problem. But at least the trace form it is a lot cleaner without all the overhead of the company packages’ communication.

The hard work
So we do the heavy lifting and take a trace on both his VM with the problem and the proxy server and sit down to compare the two. My hope was to find a dropped packet, blame the network and let those guys figure it out. And I found it. After the client hello (this is a part of the initial SSL protocol) the server responds with its server hello. That packet – a largeish packet of 1414 bytes – was not coming through to the client! It gets re-transmitted multiple times and none of the re-transmits gets through to the PC. Instead the PC receives a packet the proxy never sent it which indicates a fatal SSL error has occurred.

So I tell Regal that look there’s a problem with these packets. Meanwhile Regal has just gotten a new PC and doesn’t even have Wireshark. Can you imagine such a world? It seems all he really has is his tongue and the ability to read a few emails. And he’s not convinced! He reasons after all that the network has no intelligent, application-level devices and certainly wouldn’t single out Jabber communication to be dropped while keeping everything else. I am no desktop expert so I admit that maybe some application on the PC could have done this to the packets, in effect admitting that packets could be intercepted and altered by the PC even before being recorded by Wireshark. After all I repeated this mantra many times throughout:

This explanation xyz is unlikely, and in fact any explanation we can conceive of is unlikely, yet one of them will prove to be correct in the end.

Meanwhile the problem wasn’t going away so I kludged their proxy PAC file to send everyone using jabber to the one proxy where it worked for all.

So what we really needed was to create a span port on the switch where the PC was plugged in and connect a 2nd PC to a port enabled in promiscuous mode with that mirrored traffic. That’s quite a lot of setup and we were almost there when our power user began to work so we couldn’t reproduce the problem. That was about Dec 1st. Then our 2nd power user also fell through and could no longer reproduce the problem either a day later.

10,000 foot view
What we had so far is a whole bunch of contradictory evidence. Network? Desktop? We still could not say due to the contradictions, the likes of which I’ve never witnessed.

Affiliates affected and find the problem
Meanwhile an affiliate began to see the problem and independently examined it. They made much faster progress than we did. Within a day they found the reason (suggested by their networking person from the telecom, who apparently is much better than ours): the server hello packet has the expedited forwarding (EF) flag set in the differentiated code services point (DSCP) section of the IP header.

Say what?
So I really got schooled on this one. I was saying It has to be an application-aware “something” on the network or PC that is purposefully messing around with the SSL communication. That’s what the evidence screamed to me. So a PC-based firewall seemed a strong contender and that is how Regal was thinking.

So the affiliate explained it this way: the company uses QOS on their routers. Phone (VOIP) gets priority and is the only application where the EF bit is expected to be set. VOIP packets are small, by the way. Regular applications like web sites should just use the default QOS. And according to Wikipedia, many organizations who do use QOS will impose thresholds on the EF pakcets such that if the traffic exceeds say 30% of link capacity drop all packets with EF set that are over a certain size. OK, maybe it doesn’t say that, but that is what I’ve come to understand happens. Which makes the dropping of these particular packets the correct behaviour as per the company’s own WAN contract and design. Imagine that!

Smoking gun no more
So now my smoking gun – blame it on the network for dropped packets – is turned on its head. Cisco has set this EF bit on its server hello response on the loginp.webexconnect.com web site. This is undesirable behaviour. It’s not a phone call after all which requires a minimum jitter in packet timing.

So next time I did a trace I found that instead of EF flag being set, the AF (Assured Forwarding) flag was set. I suppose that will make handling more forgiving inside the company’s network, but I was told that even that was too much. Only default value of 0 should be set for the DSCP value. This is an open issue in Cisco’s hands now.

But at least this explains most observations. Small MTU worked? Yup, those packets are looked upon more favorably by the routers. One proxy worked, the other did not? Yup, they are in different data centers which have different bandwidth utilization. The one where it was not working has higher utilization. Only affected users are at WAN sites? Yup, probably only the WAN routers are enforcing QOS. Worked over VPN, even on a PC showing the problem? Yup – all VPN users use a LAN connection for their proxy settings. Fabricated SSL fatal error packet? I’m still not sure about that one – guess the router sent it as a courtesy after it decided to drop the server hello – just a guess. Problem fixed by shutting down Kaspersky? Nope, guess that was a red herring. Every problem has dead ends and red herrings, just a fact of life. And anyway that behaviour was not very consistent. Problem started November 17th? Yup, the affiliate just happened to have a baseline packet trace from November 2nd which showed that DSCP was not in use at that time. So Cisco definitely changed the behaviour of Cisco Jabber sometime in the intervening weeks. Other web sites worked, except this one? Yup, other web sites do not use the DSCP section of the IP header so it has the default value of 0.

Conclusion
Cisco has decided to remove the DSCP flag from these packets, which will fix everything. Perhaps EF was introduced in support of Cisco Jabber’s extended use as a soft phone??? Then this company may have some re-design of their QOS to take care of because I don’t see an easy solution. Dropping the MTU on the proxy to 512 seems pretty drastic and inefficient, though it would be possible. My reading of TCP is that nothing prevents QOS from being set on any sort of TCP packet even though there may be a gentleman’s agreement to not ordinarily do so in all except VOIP packets or a few other special classes. I don’t know. I’ve really never looked at QOS before this problem came along.

The company is wisely looking for a way to set all packets with DSCP = 0 on the Intranet, except of course those like VOIP where it is explicitly supposed to be used. This will be done on the Internet router. In Cisco IOS it is possible with a policy map and police setting where you can set set-dscp-transmit default. Apparently VPN and other things that may check the integrity of packets won’t mind the DSCP value being altered – it can happen anywhere along the route of the packet.

Boy applications these days are complicated! And those rare times when they go wrong really require a bunch of cooperating experts to figure things out. No one person holds all the expertise any longer.

My simplistic paradigm of its either the PC or the network had to make room for a new reality: it’s the web site in the cloud that did them in.

Could other web sites be similarly affected? Yes it certainly seems a possibility. So I now know to check for use of DSCP if a particular web site is not working, but all others are.

References and related
This Wikipedia article is a good description of DSCP: https://en.wikipedia.org/wiki/Differentiated_services

Categories
Linux Network Technologies SLES TCP/IP

Ethernet Bridging on the cheap. Fail. Then Success with OLTV

Intro
Some experiments just don’t work out. I became curious about a technology that has various names: ethernet bridging, wide-area VLANs, OTV, L2TP, etc. It looked like it could be done on the cheap, but that didn’t pan out for me. But later on we got hold of high-end gear that implements OTV and began to get it to work.

The details
What this is is the ability to extend a subnet to a remote location. How cool is that? This can be very useful for various reasons. A disaster recovery center, for instance, which uses the same IP addressing. A strategic decision to move some, but not all equipment on a particular LAN to another location, or just for the fun of it.

As with anything truly useful there is an open source implementation(s). I found openvpn, but decided against it because it had an overall client/server description and so didn’t seem quite what I had in mind. Openvpn does have a page about creating an ethernet bridging setup which is quite helpful, but when you install the product it is all about the client/server paradigm, which is really not what I had in mind for my application.

Then I learned about Astaro RED at the Amazon Cloud conference I attended. That’s RED as in Remote Ethernet Device. That sounded pretty good, but it didn’t seem quite what we were after. It must have looked good to Sophos as well because as I was studying it, Sophos bought them! Asataro RED is more for extending an ethernet to remote branch offices.

More promising for cheapo experimentation, or so I thought at the time, is etherip.

Very long story short, I never got that to work out in my environment, which was SLES VM servers.

What seems to be the most promising solution, and the most expensive, is overlay transport virtualization (OLTV or simply OTV), offered by Cisco in their Nexus switches. I’ll amend this post when I get a chance to see if it worked or not!

December Update
OTV is beginning to work. It’s really cool seeing it for the first time. For instance, I have a server in South Carolina on an OTV subnet, IP 10.94.45.2. Its default gateway is in New Jersey! Its gateway is in the ARP table, as it has to be, but merely to PING the gateway produces this unusual time lag:

> ping 10.94.45.1

PING 10.94.45.1 (10.194.54.33) 56(84) bytes of data.
64 bytes from 10.94.45.1: icmp_seq=1 ttl=255 time=29.0 ms
64 bytes from 10.94.45.1: icmp_seq=2 ttl=255 time=29.1 ms
64 bytes from 10.94.45.1: icmp_seq=3 ttl=255 time=29.6 ms
64 bytes from 10.94.45.1: icmp_seq=4 ttl=255 time=29.1 ms
64 bytes from 10.94.45.1: icmp_seq=5 ttl=255 time=29.4 ms

See those response times? Huge. I ping the same gateway from a different LAN but same server room in New Jersey and get this more typical result:

# ping 10.94.45.1

Type escape sequence to abort.
Sending 5, 64-byte ICMP Echos to 10.94.45.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/0/1 ms
Number of duplicate packets received = 0

But we quickly stumbled upon a gotcha. Large packets were killing us. The thing is that it’s one thing to run OTV over dark fiber, which we know another customer is doing without issues; but to run it in an MPLS network is something else.

Before making any adjustment on our servers we found behaviour like the following:
– initial ssh to linux server works OK; but session soon freezes after a directory listing or executing other commands
– pings with the -s parameter set to anything greater than 1430 bytes failed – they didn’t get returned

So this issue is very closely related to a problem we observed on a regular segment where getvpn had just been implemented. That problem, which manifested itself as occasional IE errors, is described in some detail here.

Currently we don’t see our carrier being able to accommodate larger packets so we began to see what we could alter on our servers. On Checkpoint IPSO you can lower the MTU as follows:

> dbset interface:eth1c0:ipmtu 1430

The change happens immediately. But that’s not a good idea and we eventually abandoned that approach.

On SLES Linux I did it like this:

> ifconfig eth1 mtu 1430

In this platform, too, the change takes place right away.

By the we experimented and found that the largest MTU value we could use was 1430. At this point I’m not sure how to make this change permanent, but a little research should show how to do it.

After changing this setting, our ssh sessions worked great, though now we can’t send pings larger than 1402 bytes.

The latest problem is that on our OTV segment we can ping only one device but not the other.

August 2013 update
Well, we are resourceful people so yes we got it running. Once the dust settled OTV worked pretty well, with certain concessions. We had to be able to control the MTU on at least one side of the connection, which, fortunately we always could. Load balancers, proxy servers, Linux servers, we ended up jiggering all of them to lower their MTU to 1420. For firewall management we ended up lowering the MTU on the centralized management station.

Firewalls needed further voodoo. After pushing policy clamping needs to be turned back on and acceleration off like this (for Checkpoint firewalls):

$ fw ctl set int fw_clamp_tcp_mss 1
$ fwaccel off

Conclusion
Having preserved IPs during a server move can be a great benefit and OTV permits it. But you’d better have a talented staff to overcome the hurdles that will accompany this advanced technology.

Categories
TCP/IP

C++ TCP Socket Program

Intro
I was looking around for a sample TCP socket program written in C++ that might make working with TCP sockets less mysterious. I expected to find a flood of things to pick from, but that really wasn’t the case.

The Details
OK, I only looked for a few minutes, to be honest. The one I did settle on seems adequate. It’s sufficiently old, however, that it doesn’t actually work as-is. Probably if it did I wouldn’t even mention it. So I thought it was worth repeating here, with some tiny semantic updates.

What I used is from this web page: http://cs.baylor.edu/~donahoo/practical/CSockets/practical/. I was really only interested in the TCP echo client. It’s a good stand-ion for any TCP client I think.

Here’s TCPechoClient.cpp:

/*
 *   C++ sockets on Unix and Windows
 *   Copyright (C) 2002
 *
 *   This program is free software; you can redistribute it and/or modify
 *   it under the terms of the GNU General Public License as published by
 *   the Free Software Foundation; either version 2 of the License, or
 *   (at your option) any later version.
 *
 *   This program is distributed in the hope that it will be useful,
 *   but WITHOUT ANY WARRANTY; without even the implied warranty of
 *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *   GNU General Public License for more details.
 *
 *   You should have received a copy of the GNU General Public License
 *   along with this program; if not, write to the Free Software
 *   Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
 */
 
// taken from http://cs.baylor.edu/~donahoo/practical/CSockets/practical/TCPEchoClient.cpp
 
#include "PracticalSocket.h"  // For Socket and SocketException
#include <iostream>           // For cerr and cout
#include <cstdlib>            // For atoi()
#include <cstring>            // author forgot this
 
using namespace std;
 
const int RCVBUFSIZE = 32;    // Size of receive buffer
 
int main(int argc, char *argv[]) {
  if ((argc < 3) || (argc > 4)) {     // Test for correct number of arguments
    cerr << "Usage: " << argv[0]
         << " <Server> <Echo String> [<Server Port>]" << endl;
    exit(1);
  }
 
  string servAddress = argv[1]; // First arg: server address
  char *echoString = argv[2];   // Second arg: string to echo
// DrJ test
//  echoString = "GET / HTTP/1.0\n\n";
  int echoStringLen = strlen(echoString);   // Determine input length
  unsigned short echoServPort = (argc == 4) ? atoi(argv[3]) : 7;
 
  try {
    // Establish connection with the echo server
    TCPSocket sock(servAddress, echoServPort);
 
    // Send the string to the echo server
    sock.send(echoString, echoStringLen);
 
    char echoBuffer[RCVBUFSIZE + 1];    // Buffer for echo string + \0
    int bytesReceived = 0;              // Bytes read on each recv()
    int totalBytesReceived = 0;         // Total bytes read
    // Receive the same string back from the server
    cout << "Received: ";               // Setup to print the echoed string
    while (totalBytesReceived < echoStringLen) {
      // Receive up to the buffer size bytes from the sender
      if ((bytesReceived = (sock.recv(echoBuffer, RCVBUFSIZE))) <= 0) {
        cerr << "Unable to read";
        exit(1);
      }
      totalBytesReceived += bytesReceived;     // Keep tally of total bytes
      echoBuffer[bytesReceived] = '\0';        // Terminate the string!
      cout << echoBuffer;                      // Print the echo buffer
    }
    cout << endl;
 
    // Destructor closes the socket
 
  } catch(SocketException &e) {
    cerr << e.what() << endl;
    exit(1);
  }
 
  return 0;
}

Note the cstring header file I needed to include. The standard must have changed to require this since the original code was published.

Then I neeed PracticalSocket.h, but that has no changes from the original version: “http://cs.baylor.edu/~donahoo/practical/CSockets/practical/PracticalSocket.h, and his Makefile is also just fine: http://cs.baylor.edu/~donahoo/practical/CSockets/practical/Makefile. For the fun of it I also set up the TCP Echo Server: http://cs.baylor.edu/~donahoo/practical/CSockets/practical/TCPEchoServer.cpp.

Run

make TCPEchoclient

and you should be good to go. How to test this TCPEchoClient against your web server? I found that the following works:

~/TCPEchoClient drjohnstechtalk.com 'GET / HTTP/1.0
Host: drjohnstechtalk.com
 
' 80

which gives this output:

Received: HTTP/1.1 301 Moved Permanently
Date: Thu, 23 Feb 2012 17:19:02

which, now that I analyze it, looks cut-off. Hmm. Because with curl I have:

curl -i drjohnstechtalk.com

HTTP/1.1 301 Moved Permanently
Date: Thu, 23 Feb 2012 17:19:43 GMT
Server: Apache/2.2.16 (Ubuntu)
X-Powered-By: PHP/5.3.3-1ubuntu9.5
Location: http://www.drjohnstechtalk.com/blog/
Vary: Accept-Encoding
Content-Length: 2
Content-Type: text/html

I guess that’s what you get for demo code. At this point I don’t have a need to sort it out so I won’t. Perhaps we’ll come back to it later. Looking at it, I see the received buffer size is quite small, 32 bytes. I tried to set that to a reasonable value, 200 MBytes, but get a segmentation fault. The largest I could manage, after experiementation, is 10000000 bytes:

//const int RCVBUFSIZE = 32;    // Size of receive buffer
const int RCVBUFSIZE = 10000000;    // Size of receive buffer - why is 10 MB the max. value??

and this does indeed give us the complete output from our web server home page now.

Conclusion
There is some demo C++ code which creates a useable class for dealing with TCP sockets. There might be some work to do before it could be used in a serious application, however.