Categories
Network Technologies TCP/IP

Spin up your own VPN with OpenVPN

Intro
I recently visited a foreign country where I was unable to watch an Amazon Prime Original show because of my location. Annoyed, I decided then and there to investigate OpenVPN. I am an ideal candidate – I already run my own Linux server in the Amazon cloud, and I know Linux and networking, so I’ve pretty much got all the ingredients already present. The software is free and I will incur no additional cost if I ever do get it working since I already pay for my server which is primarily used as a low-demand web server. Little did I know what I was getting myself into!

The details
I seem to have made every mistake in the book, but I have a strong grasp of the networking involved so I was confident in my general approach. Seeing how configurable the software is I decided to ramp up my efforts incrementally, introducing more and more features until I arrived to the minimum desired (still not there, by the way, but getting close!).

The package on various OSes
Conveniently, openvpn is an installable package on SLES and Centos. On SLES a
zypper install openvpn suffices whereas in CentOS a yum install openvpn does the trick. In Debian Linux such as Raspbian, an apt-get install openvpn works to install it.

Then look at the examples at the bottom of the man page for openvpn. I picked two servers where I have root and tried to replicate their most basic example. I think this is really important to crawl before you run. To paraphrase it:

Example 1: A simple tunnel without security
       On may:
 
              openvpn --remote june.kg --dev tun1 --ifconfig 10.4.0.1 10.4.0.2 --verb 9
 
       On june:
 
              openvpn --remote may.kg --dev tun1 --ifconfig 10.4.0.2 10.4.0.1 --verb 9
 
       Now verify the tunnel is working by pinging across the tunnel.
 
       On may:
 
              ping 10.4.0.2
 
       On june:
 
              ping 10.4.0.1

june.kg and may.kg are the IP addresses or FQDNs of the june and may server, respectively.

If you installed from a package you probably don’t need these preliminary steps they mention:

              mknod /dev/net/tun c 10 200
 
              modprobe tun

I checked it by a directory listing of /dev/net/tun:

crw-rw-rw- 1 root root 10, 200 Jan  6 08:35 /dev/net/tun

and running modprobe tun for good measure.

And, yes, their basic example worked. How? The command creates a virtual adapter, tun0, specially designed for tunnels, which records the private tunnel IP of the server itself as well as the IP of the tunnel endpoint on the other server.

It’s so cool. What’s not well explained is that you ought to choose completely private IPs for building your tunnels that don’t interfere with your other private IPs. You’ll see I eventually settled on 172.27.28.29 – I’ve never seen that range used anywhere.

You quit running the openvpn command and the tun0 adapter is destroyed. It’s very tidy.

A small word about routing for now
What about routing. How does that work? What you probably didn’t appreciate is that in this simple example although you can ping each other using the tunnel IP, you really can’t do any more than that unless you start to introduce additional routes. So as is it’s a long way from where we need to be. More on routing later. Check your route table with a netstat -rn

Finding the right example is surprisingly hard
For such a popular program you’d think examples of what I’m trying to do – change the effective IP of my laptop – would abound and implementation would be a piece of cake. But alas, nothing could be further from the truth. I’ve yet to see a complete example so I cobbled together things from various places.

You gotta understand that for a lot of people with enough tech-savvy to write up what they did, I guess they were just tickled pink to be able to tunnel through their home router and access their home networks. That’s fine and all, but it’s not all that pertinent to my usage, where the routing considerations are pretty different.

Anywho, this article is really helpful, though not sufficient by itself:

https://openvpn.net/index.php/open-source/documentation/miscellaneous/78-static-key-mini-howto.html

That describes a single server/client setup with simple security.

Server config
After a lot of debugging and incremental steps, I’m currently using this file with filepath /etc/openvpn/server.conf on my Amazon AWS server:

# -DrJ 1/5/16
# simple one server/one client setup...
# https://openvpn.net/index.php/open-source/documentation/miscellaneous/78-static-key-mini-howto.html
# static.key generated with openvpn --genkey --secret static.key
# iptables NAT discussion: http://www.karlrupp.net/en/computer/nat_tutorial
# using (simple udp test worked): iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
# list: iptables -t nat -L -v
# dig on RasPi: from dnsutils
dev tun
# 1194 is default, but...
port 2096
ifconfig 172.27.28.29 172.27.28.30
secret static.key
#use compresison
comp-lzo
# resist failures
keepalive 10 60
ping-timer-rem
persist-tun
persist-key
# run as daemon
user nobody
group nobody
daemon
proto tcp-server

An experimental client config file currently looks like this:

# for understanding what openvpn can do...
# simple single server/client setup from https://openvpn.net/index.php/open-source/documentation/miscellaneous/78-static-ke
y-mini-howto.html
# - DrJ 1/6/16
remote drjohnstechtalk.com
# 1994 is default but I'll use this one...
port 2096
dev tun
ifconfig 172.27.28.30 172.27.28.29
secret static.key
# resist failures
keepalive 10 60
ping-timer-rem
persist-tun
persist-key
#use compresison
comp-lzo
# for testing
proto tcp-client
# a test host route
#route 172.16.0.23
# a test network route which includes endpoint
route 172.16.0.0 255.240.0.0
# stands for broad, Internet route which may overlap our route to the openvpn server: doesn't work by itself!
route drjohnstechtalk.com 255.255.255.255 net_gateway
route 50.17.188.0 255.255.255.0

Works through proxy
Amazingly, I can confirm openvpn works through a standard http proxy. This is both cool and a little scary. To the above config file I added something like this:

# use proxy which requires basic authentication
http-proxy proxy-1.johnstechtalk.com 8080 authfile

where authfile is in the same directory as client.conf (/etc/eopnvpn) and has the proxy username and password on two separate lines.

You have to run your server in TCP mode in order to access it from a client that’s behind an http proxy however, hence the proto tcp-server in the config line of the server and the proto tcp-client line in the client config. I’ll experiment with that to see if that’s costing performance.

A little more on routing
Why the circumspect routes? I’m deathly afraid of locking myself out of these servers by implementing the wrong routes! And I learned a lot bootstrapping my way to broader routes. The thing is, I noticed my AWS server uses this private IP address for its DNS server: 172.16.0.23 (cat /etc/resolv.conf, for example). So I realized that I could introduce a host route to that IP on the openvpn client to begin the arduous process of building up and testing real routing. How to test? Simple. With commands like

> dig ns johnstechtalk.com @172.16.0.23

on the client.

Dig is a useful networking tool. I describe installing it on Windows systems in this post.

Point of concern over performance
Even before expanding the routes I am concerned about TCP performance. I noticed that when I add the +tcp option to dig to force the query to use TCP I get a big performance penalty. a regular query that takes 60 msec to the AWS DNS server from my PC takes anywhere from 200 – 400 msec over TCP! Now there are a bunch more packets a back-and-forth, but all that aside and still there is a disconcerting performance hit of 100 msec or so. By contrast the enterprise-class VPN provided by Juniper suffers from no such TCP performance penalty – I know because I tested dig over a Juniper VPN using the JunOS Pulse client.

Up our game, try to run as a full server w/ DHCP and everything
I’ll show the config file below. Let me just mention that my first attempt didn’t work because I continued to use a shared secret (the secret declaration), but that is incompatible with a new statement I introduced on the server:

server 172.27.28.32 255.255.255.224

So I gotta bite the bullet and get my PKI infrastructure up and running. In the old days they bundled easy-rsa with openvpn. Now you have to install it separately. I was able to do a yum install easy-rsa on my CentOS instance.

Hey, easy-rsa is pretty cool – I might be able to use that for other things. I always wanted to know how to create my own CA! The instructions in the Howto are so complete that there’s really no need to go over that here. HowTo link.

I copied my /usr/share/easy-rsa/2.0/* files to /etc/openvpn/rsa and did all the pki-buildingwork there. But I don’t want to go crazy with encyrption so I downgraded diffie-hellman from 2048 to 1024 bits:

$ openssl dhparam -out dh1024.pem 1024

Full Monty – doing a full VPN
Enough warm-up. I tried full VPN and for some reason hit the right configuration on the first try.

Here’s my server.conf file on my Amazon AWS server.

# -DrJ 1/13/16
# multi-client server from https://openvpn.net/index.php/open-source/documentation/howto.html#examples
# static.key generated with openvpn --genkey --secret static.key
# iptables NAT discussion: http://www.karlrupp.net/en/computer/nat_tutorial
# using (simple udp test worked): iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
# list: iptables -t nat -L -v
port 1194
proto udp
dev tun
ca rsa/keys/ca.crt
cert rsa/keys/server.crt
key rsa/keys/server.key  # This file should be kept secret
dh rsa/keys/dh1024.pem
# pool of addresses for clients and server. server gets 172.27.28.33
server 172.27.28.32 255.255.255.224
ifconfig-pool-persist ipp.txt
# very experimental
push "redirect-gateway def1 bypass-dhcp"
# just use closest Google DNS server
push "dhcp-option DNS 8.8.8.8"
# resist failures
keepalive 10 60
ping-timer-rem
# use compression
comp-lzo
# only allow two clients max for now
max-clients 2
user nobody
group nobody
persist-key
persist-tun
status openvpn-status.log
# verbosity level. 0 - all but fatal errors, 9 - extremely verbose
verb 3

And here’s my Windows 10 client file.

# from https://openvpn.net/index.php/open-source/documentation/howto.html#examples
# - DrJ 1/17/16
client
dev tun
proto udp
remote drjohnstechtalk.com 1194
nobind
# resist failures
keepalive 10 60
ping-timer-rem
persist-tun
persist-key
#use compresison
comp-lzo
# SSL/TLS parmas
ca ca.crt
cert client1.crt
key client1.key
# stands for broad, Internet route which may overlap our route to the openvpn server: doesn't work by itself!
route drjohnstechtalk.com 255.255.255.255 net_gateway
verb 3

I was a bit concerned the DHCP stuff might not work, but it did, including assigning a Google DNS server at 8.8.8.8.

I generated the client1.crt and client1.key using easy-rsa. These plus ca.crt I copied down to my laptop. For security I then deleted client1.key on my server (so no one else can grab it).

Performance
Performance is amazing. There no longer is the penalty for doing dig over tcp. speedtest.net actually shows better results when I am connected over my VPN! This is a totally unexpected surprise. I guess that’s due to the compression, using UDP and not using overly strong encryption.

Speedtest results
speedtest.net results without VPN, i.e., regular mode: 8.0 mbps download, 1.0 mbps upload. With VPN I got 11.0 mbps download and 1.2 mbps upload.

A platform that didn’t work
You’ll see from my other blogs that I am a Raspberry Pi fan. so naturally I wanted to bring up a VPN client on the Raspberry Pi. I am stuck on this point however because unlike the SLES Linux I tested with, Raspbian brings up a tun interface but then drops the wlan0 IP address so all communication to it is lost. Maybe it would work better wired – I’ll try it and post the results.

Still more about routing
I never understood how the openvpn examples were supposed to work until I had to implement them. Indeed they generally wouldn’t work until you address some routing and NAT issues. There is no magic. I did traces to show my client was trying to communicate with its assigned private IP of 172.27.28.29. Well that’s never going to work. You have to make sure routing is enabled on your server:

On Linux, use the command:
$ echo 1 > /proc/sys/net/ipv4/ip_forward

But that’s not sufficient either. That doesn’t address that the packets from your client have the wrong IP address as far as the outside world is concerned. So you have to NAT (address translate) those packets. I don’t like iptables but this turned out to be easier than expected. As mentioned in my server config file you run:

$ iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

This is the boiled down summary from this helpful article.

Don’t forget that to look at your NAT rules you need a command like this:

$ iptables -t nat -L -v

not simply an iptables -L.

Note that this is a hide NAT – I am not making the openvpn client visible as a server on the Internet. That’s a lot harder with IP addresses being in scarce supply.

Does it work to solve my original problem? Well I’m not in a foreign country to run that test, but I can vouch that I can run Amazon Prime through openvpn. So I expect it will work overseas as well.

It works so well how do I know I’m really running through openvpn?

Many ways. For instance my routing table. From a CMD window:

C:> netstat -rn

IPv4 Route Table
===========================================================================
Active Routes:
Network Destination        Netmask          Gateway       Interface  Metric
          0.0.0.0          0.0.0.0      192.168.2.1      192.168.2.8     25
          0.0.0.0        128.0.0.0     172.27.28.37     172.27.28.38     20
    50.17.188.196  255.255.255.255      192.168.2.1      192.168.2.8     25
        127.0.0.0        255.0.0.0         On-link         127.0.0.1    306
        127.0.0.1  255.255.255.255         On-link         127.0.0.1    306
  127.255.255.255  255.255.255.255         On-link         127.0.0.1    306
        128.0.0.0        128.0.0.0     172.27.28.37     172.27.28.38     20
      169.254.0.0      255.255.0.0         On-link       192.168.2.8    306
  169.254.255.255  255.255.255.255         On-link       192.168.2.8    281
     172.27.28.33  255.255.255.255     172.27.28.37     172.27.28.38     20
     172.27.28.36  255.255.255.252         On-link      172.27.28.38    276
     172.27.28.38  255.255.255.255         On-link      172.27.28.38    276
     172.27.28.39  255.255.255.255         On-link      172.27.28.38    276
      192.168.2.0    255.255.255.0         On-link       192.168.2.8    281
      192.168.2.8  255.255.255.255         On-link       192.168.2.8    281
    192.168.2.255  255.255.255.255         On-link       192.168.2.8    281
        224.0.0.0        240.0.0.0         On-link         127.0.0.1    306
        224.0.0.0        240.0.0.0         On-link       192.168.2.8    281
        224.0.0.0        240.0.0.0         On-link      172.27.28.38    276
  255.255.255.255  255.255.255.255         On-link         127.0.0.1    306
  255.255.255.255  255.255.255.255         On-link       192.168.2.8    281
  255.255.255.255  255.255.255.255         On-link      172.27.28.38    276

Specifically note the two routes to 0.0.0.0 netmask 128.0.0.0 and 128.0.0.0 netmask 128.0.0.0 via gateway 172.27.28.37. That’s just what this experimental command in my server config file was supposed to do:

push "redirect-gateway def1 bypass-dhcp"

namely, create two broad routes to the entire Internet, just slightly more specific than a default route so they take precedence over the default route – I think it’s a clever idea. And of course ipconfig shows my private IP address:

Ethernet adapter Ethernet 2:
 
   Connection-specific DNS Suffix  . :
   Link-local IPv6 Address . . . . . : fe80::8871:c910:185e:953b%5
   IPv4 Address. . . . . . . . . . . : 172.27.28.38
   Subnet Mask . . . . . . . . . . . : 255.255.255.252
   Default Gateway . . . . . . . . . :

References and related
No patience to roll your own? Five top commercial VPN offerings are described here: http://www.itworld.com/article/3152904/security/top-5-vpn-services-for-personal-privacy-and-security.html
A lightweight dig installation method for Windows is described here.
My article about iptables.

Categories
Admin Linux

Narrowing down answer to NPR puzzle with Linux commands

Intro
This is for CentOS and RedHat Linux.

Narrow things down
$ egrep ′^[a-z]{6}$′ /usr/share/dict/linux.words |sed ′s/.//′|s
ort|uniq -c|sort -k1 -d -r > 6-ltr-last-5

Mind the line break in the display of this command – you have to join things back together.

This is a great string of commands to study if you want to unleash the power of the linux shell. Yuo have a matching operator, egrep, a simple regular expression, a substitution command, sed, a sort command, sort, a unique sort command, uniq, and a sort ordered by number and displayed in reverse order. I issue commands like this frequently against log files and can do much more import work than solving an NPR puzzle.

6-ltr-last-5 starts like this:

     14 itter
     14 ingle
     14 atter
     14 agged
     13 etter
     13 ester
     13 aster
     13 apper
     13 apped
     13 agger
...

It has 772 lines with 4 or greater occurrences – too many to process by hand.

Edit this file and only keep the top part of the file up until the last of the 4 occurrences.

Now go back and match these words against the dictionary.

$ cat 6-ltr-last-5 |awk ′{print $2}′|while read line;do egrep ′
^[a-z]′$line$ /usr/share/dict/linux.words >> 6-ltr-combos;echo " ">>6-ltr-combos; done

6-ltr-combos starts like this:

bitter
fitter
gitter
hitter
jitter
kitter
litter
nitter
pitter
ritter
sitter
titter
witter
zitter
 
bingle
cingle
dingle
gingle
hingle
jingle
...

Small program to process that file
OK, to work with that file we just created based on the logic of the problem statement, I created this custom perl script which I call 6-5.pl:

#!/usr/bin/perl
$DEBUG = 0;
$consonants = 'bcdfghjlmnpqrstvwxyz';
$oldplace = -1;
$pot = 0;
while(<STDIN>){
  if (/^\s/) {
    print "pot,start word = $pot, $startword\n" if $pot > 3;
# reset some  values
    $oldplace = -1;
    $pot = 0;
    $startword = $_;
  }
  chomp;
# get at first character
  ($char) = /^(\w)/;
# turn character into position number with this
  $place = index $consonants,$char;
  print "word,place: $_,$place\n" if $DEBUG;
  if ($place != $oldplace + 1) {
# clear things out
    print "pot,start word = $pot, $startword\n" if $pot > 3;
    $pot  = 1;
    $startword = $_;
  } else {
    $pot++;
  }
  print "pot: $pot\n" if $DEBUG;
  $oldplace = $place;
}

I really wish I knew Python – I bet it would be an even shorter script in that language. But this gets the job done. It’s warts and all as I have done enough debugging to get it to return mostly reasonable output, but it’s still not quite right. It’s good enough…

Run it:

$ ./6-5.pl < 6-ltr-combos

pot,start word = 4, fitter
pot,start word = 7, gingle
pot,start word = 4, latter
pot,start word = 8, dagged
pot,start word = 5, fetter
pot,start word = 5, jester
pot,start word = 6, dagger
...

The biggest problem is my dictionary contains too many uncommon words, but at least that guarantees that the answer will indeed be present. And it is. In fact I found three sets of what I consider common words. One set are very ordinary words so i guess that is the intended answer. I can’t give away everything right now – you’ll have to do some work! I’ll post the answers after Sunday.

References and related
A similar approach to a previous puzzle is here.

Categories
Network Technologies Python

Tips on using scapy for custom IP packets

Intro
scapy is an IP packet customization tool that keeps coming up in my searches so I could no longer avoid it. I was unnecessarily intimidated because it was built around python and the documentation is a little strange. But I’m warming up to it now…

The details
Download and install
CentOS
Just go to scapy.net and it will propose to you to download the .zip file. I got scapy-2.3.1.zip. Then you can unzip it; change directory to the scapy-2.3.1 sub-directory and run

$ sudo python setup.py install

Debian systems such as Raspberry Pi
Simple. It’s just:

$ sudo apt-get install python-scapy

Usage modes
scapy can be called from within python, but if you’re afraid to do that like I am, you can run it from the command line which simply throws you into a python shell. I’m finding that a lot more comfortable as I slowly learn python syntax and some useful shortcuts.

Example 1
The background
Let’s cut to the chase and do something hard first. Remember how we got those Cisco Jabber packets with DSCP set, causing Cisco Jabber to not work for some users? The long-term solution according to that post is to turn off the DSCP flag for all packets on the Internet router. So we want to be able to generate packets under our control with that flag set so we can see if we’ve managed to turn it off correctly.

DSCP value occupies the first 6 bits of the 8-bit tos field. The packets we got from Cisco had DSCP of 0x2e which is Expedited Forwarding (EF), and if you do the math that corresponds to tos of 0xb8 which in decimal is 184.

$ sudo scapy
>>> sr(IP(dst="50.17.188.196",tos=184)/TCP(dport=80,sport=4025))

Begin emission:
....Finished to send 1 packets.
.*
Received 6 packets, got 1 answers, remaining 0 packets
(<Results: TCP:1 UDP:0 ICMP:0 Other:0>, <Unanswered: TCP:0 UDP:0 ICMP:0 Other:0>)
>>>

Instead of the call to sr you can simply use send. Breaking this down, I’m testing against my drjohns server with IP 50.17.188.196. tos is a property of an IP packet so it’s included as a keyword argument to the IP function. The “/” following the IP function is funny syntax but it somehow says that more properties at different layers are coming. So in the TCP section I used keyword arguments and set source port of 4025 and destination port of 80. What I observed is that this will send a SYN packet even though I didn’t explicitly identify that.

Want to have a random source port like “real” packets? Then use this:

$ >>> sr(IP(dst="50.17.188.196",tos=184)/TCP(dport=80,sport=RandShort()))

Look for it
I know tcpdump better so I look for my packet with that tool like this:

$ sudo tcpdump -v -n -i eth0 host 71.2.39.115 and port 80

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
19:33:29.749170 IP (tos 0xb8, ttl 39, id 1, offset 0, flags [none], proto TCP (6), length 44)
    71.2.39.115.partimage > 10.185.21.116.http: Flags [S], cksum 0xd97b (correct), seq 0, win 8192, options [mss 1460], length 0
19:33:29.749217 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 44)
    10.185.21.116.http > 71.2.39.115.partimage: Flags [S.], cksum 0x3e39 (correct), seq 3026513916, ack 1, win 5840, options [mss 1460], length 0
19:33:29.781193 IP (tos 0x0, ttl 41, id 19578, offset 0, flags [DF], proto TCP (6), length 40)

Interpretation
Our tos was wiped clean by the time our generated packet was received by Amazon AWS. This was a packet I sent from my home using my Raspberry Pi. So likely my ISP CenturyLink is removing QOS from packets its residential customers send out. With some ISPs and business class service I have seen the tos field preserved exactly. When sent from Amazon AWS I saw the field value altered, but not set to 0!

Example 2, ping
>>> sr(IP(dst="8.8.8.8")/ICMP())

Begin emission:
Finished to send 1 packets.
.*
Received 2 packets, got 1 answers, remaining 0 packets
(<Results: TCP:0 UDP:0 ICMP:1 Other:0>, <Unanswered: TCP:0 UDP:0 ICMP:0 Other:0>)

Getting info on return packet
$ >>> sr1(IP(dst="drjohnstechtalk.com",tos=184)/TCP(dport=80,sport=RandShort()))

Begin emission:
...............................................................................................Finished to send 1 packets.
...........................................*
Received 139 packets, got 1 answers, remaining 0 packets
<IP  version=4L ihl=5L tos=0x0 len=44 id=0 flags=DF frag=0L ttl=25 proto=tcp chksum=0xe1d7 
src=50.17.188.196 dst=144.29.1.2 options=[] |<TCP  sport=http dport=17176 seq=3590570804 ack=1 dataofs=6L reserved=0L flags=SA window=5840 chksum=0x24b0 urgptr=0 options=[('MSS', 1460)] |<Padding  load='\x00\x00' |>>>

Note that this tells me about the return packet, which is a SYN ACK. So it tells me my SYN packet must have been sent from port 17176 (it changes every time because I’ve included sport=RandShort()). Each “.” in the response indicates a packet hitting the interface. I guess it’s promiscuously listening on the interface.

Hitting a closed port

$ >>> sr1(IP(dst="drjohnstechtalk.com",tos=184)/TCP(dport=81,sport=RandShort()))

Begin emission:
....................................................................................
.........................................................................................................Finished to send 1 packets.
...............................................................................
.................................................................................
.........................................................................................

Basically those dots are going to keep going forever until you type -C, because there will be no return packet if something like a firewall is dropping your packet, or the returned packet.

Useful shortcuts
The scapy commands look pretty daunting at first, right? And too much trouble to type in, right? Just get it right once and you’re set. In typical networking debugging you’ll be running such test packets multiple times. Because it’s basically a python shell, you can use the up arrow key to recall the previous thing, or hit it multiple times to scroll through your previously typed commands. And even if you exit and return, it still remembers your command history so you can hit the up-arrow to get back to your commands from previous sesisons and previous days.

References and related
This scapy for dummies guide is very well written.
I’m finding this python tutorial really helpful.
DSCP and explanation of Cisco Jabber not working is described here.
A simpler tool which is fine for most things is nmap. I provide some real-world examples in this blog post.

Categories
Admin Web Site Technologies

The IT Detective agency: Outlook client is Disconnected, all else fine

Intro
Today we were asked to consult on the following problem. Some proxy users at a large company could not connect to Microsoft Outlook. Only a few users were affected. Fix it.

The details
Affected users would bring up Outlook and within a few short seconds it would simply show Disconnected and stay that way.

It was quickly established that the affected users shared this in common: they use LDAP authentication and proxy-basic-authentication. The users who worked used NTLM authentication. The way they distinguish one from the other is by using a different proxy autoconfiguration (PAC) file.

More observations
Well, actually there was almost no difference whatsoever between the two PAC files. They are syntactically identical. The only difference in fact is that a different proxy is handed out for the NTLM users. That’s it!

We were able to reproduce the problem ourselves by using the same PAC file as the affected user. We tried to trace the traffic on our desktop but it was a complete mess. I did not see any connection to the designated proxy for Outlook traffic, but it’s hard to say definitively because there is so much other junk present. Strangely, all web sites worked OK and even the web-based version of Outlook works OK. So this Outlook client was the only known application having a problem.

When the affected users put in the proxy directly in manual proxy settings in IE and turned off proxy autoconfig, then Outlook worked. Strange.

We observed the header for the PAC file was a little bit inconsistent (it was being served from multiple web servers through a load balancer). The content-tyep MIME header was coming back as either text/plain or there was no such header at all, depending on which web server you were hitting. But note that the NTLM users were also getting PAC files with this same header.

The solution

Although everything had been fine with this header situation up until the introduction of Outlook, we guessed it was technically incorrect and should be fixed. We changed all web servers to have the PAC file be served with this MIME header:

Content-Type: application/x-ns-proxy-autoconfig

The results

A re-test confirmed that this fixed the Outlook problem for the LDAP-affected users. NTLM users were not impacted and continued to work fine.

Conclusion
A strange Outlook connection problem was resolved in large company Intranet by adjusting the PAC file to include the correct content-type header. Case closed!

References and related information
Here’s a PAC file case we never did resolve: excessive calls to the PAC file web server from individual users.

Categories
Admin

SiteScope keeps restarting

Intro
I’m just documenting what the support tech had me do to fix this scary issue.

The details
This was a SiteScope v 11.24 instance running on a RHEL 6.6 VM.

2015-12-08 05:12:56,768 [SiteScope Main Thread] (SiteScopeSupport.java:721) ERROR - SiteScope unexpected shutdown
java.lang.reflect.UndeclaredThrowableException
        at com.sun.proxy.$Proxy106.postInit(Unknown Source)
        at com.mercury.sitescope.platform.configmanager.ConfigManager$PersistencyAdaptor.initPersistObjectsAfterLoad(ConfigManager.java:1967)
        at com.mercury.sitescope.platform.configmanager.ConfigManager$PersistencyAdaptor.initialize(ConfigManager.java:1247)
        at com.mercury.sitescope.platform.configmanager.ConfigManager$PersistencyAdaptor.access$300(ConfigManager.java:1112)
        at com.mercury.sitescope.platform.configmanager.ConfigManager.initialize(ConfigManager.java:145)
        at com.mercury.sitescope.platform.configmanager.ConfigManagerSession.initialize(ConfigManagerSession.java:153)
        at com.mercury.sitescope.bootstrap.SiteScopeSupport.initializeSiteScope(SiteScopeSupport.java:592)
        at com.mercury.sitescope.bootstrap.SiteScopeSupport.configureSiteScope(SiteScopeSupport.java:629)
        at com.mercury.sitescope.bootstrap.SiteScopeSupport.siteScopeMain(SiteScopeSupport.java:678)
        at com.mercury.sitescope.web.servlet.InitSiteScope$SiteScopeMainThread.run(InitSiteScope.java:233)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedMethodAccessor109.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.mercury.sitescope.platform.configmanager.ManagedObjectConfigRef$ManagedObjectProxyHandler.invoke(ManagedObjectConfigRef.java:290)
        ... 10 more
Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
        at com.mercury.sitescope.entities.monitors.MonitorGroup.readDynamic(MonitorGroup.java:445)
        at com.mercury.sitescope.entities.monitors.MonitorGroup.postInit(MonitorGroup.java:2001)
        ... 14 more
2015-12-08 05:12:56,776 [SiteScope Main Thread] (SiteScopeShutdown.java:51) INFO  - Shutting down SiteScope reason Exception: java.lang.reflect.UndeclaredThrowableException null...
2015-12-08 05:12:56,784 [ShutdownThread at Tue Dec 08 05:12:56 EST 2015] (SiteScopeGroup.java:1061) INFO  - Stopping dynamic counters flow...
2015-12-08 05:12:56,832 [ShutdownThread at Tue Dec 08 05:12:56 EST 2015] (SiteScopeGroup.java:1112) INFO  - Waiting 40 secs to allow monitors to complete.
2015-12-08 05:13:36,854 [ShutdownThread at Tue Dec 08 05:12:56 EST 2015] (SiteScopeGroup.java:1280) INFO  - Average Monitors Running: 0
2015-12-08 05:13:36,855 [ShutdownThread at Tue Dec 08 05:12:56 EST 2015] (SiteScopeGroup.java:1281) INFO  - Peak Monitors Per Minute: 0 at 7:00 pm 12/31/69
2015-12-08 05:13:36,855 [ShutdownThread at Tue Dec 08 05:12:56 EST 2015] (SiteScopeGroup.java:1283) INFO  - Peak Monitors Running: 0.0 at 7:00 pm 12/31/69
2015-12-08 05:13:36,855 [ShutdownThread at Tue Dec 08 05:12:56 EST 2015] (SiteScopeGroup.java:1284) INFO  - Peak Monitors Waiting: 0 at 7:00 pm 12/31/69

And this just kept happening and happening.

The solution
The support tech from HPE had me go in the groups directory and delete all files except those ending in .dyn and .config. Those directories are in the /opt/HP/SiteScope directory on my installation.

In the persistency directory we deleted all files ending in .tmp. But we made saved copies of the entire original groups and persistency directories elsewhere just in case.

The results
HP siteScope started just fine after that! A healthy siteScope startup includes lines like these:

2015-12-08 08:59:12,149 [SiteScope Main] (SiteScopeGroup.java:995) INFO  - Open your web browser to:
2015-12-08 08:59:12,149 [SiteScope Main] (SiteScopeGroup.java:996) INFO  -   http://10.192.136.89:8080
2015-12-08 08:59:12,180 [SiteScope Main] (SiteScopeGroup.java:324) INFO  - Starting common scheduler...
2015-12-08 08:59:12,273 [SiteScope Main] (SiteScopeGroup.java:344) INFO  - Starting maintenance scheduler...
2015-12-08 08:59:12,381 [SiteScope Main] (SiteScopeGroup.java:609) INFO  - Starting topaz manager
2015-12-08 08:59:12,384 [SiteScope Main] (SiteScopeGroup.java:611) INFO  - Topaz manager started.
2015-12-08 08:59:12,386 [SiteScope Main] (SiteScopeGroup.java:457) INFO  - Starting monitor scheduler...
2015-12-08 08:59:12,386 [SiteScope Main] (SiteScopeGroup.java:462) INFO  - Starting report scheduler...
2015-12-08 08:59:12,398 [SiteScope Main] (SiteScopeGroup.java:477) INFO  - Starting analytics scheduler...
2015-12-08 08:59:12,398 [SiteScope Main] (SiteScopeGroup.java:488) INFO  -
2015-12-08 08:59:12,398 [SiteScope Main] (SiteScopeGroup.java:489) INFO  - SiteScope is active...
2015-12-08 08:59:12,493 [SiteScope Main] (SiteScopeGroup.java:513) INFO  - SiteScope Starting all monitors
2015-12-08 08:59:12,830 [SiteScope Main] (SiteScopeGroup.java:517) INFO  - SiteScope Start monitors completed
2015-12-08 08:59:13,037 [SiteScope Main] (SiteScopeGroup.java:532) INFO  - SiteScope Startup Completed
2015-12-08 08:59:13,215 [SiteScope Main] (SiteScopeSupport.java:713) INFO  - SiteScope 11.24.241  build 165 process started at Tue Dec 08 08:59:13 EST 2015
2015-12-08 08:59:13,215 [SiteScope Main] (SiteScopeSupport.java:714) INFO  - SiteScope Start took 26 sec

Especially that last line.

The hypothesis
As the server had crashed the hypothesis is that one of the files must have gotten corrupted.

Conclusion
HP SiteScope which could not start was fixed by removing some older files. These files are not needed, either – your configuration will not be lost when you delete them.

Categories
Linux Perl

Solution to NPR puzzle: capitals and cities

Problem statement
Take a US state capital. Drop a letter. Re-arrange the remaining letters to get the name of a US city. There are two answers.

I know Wlll Shortz says he makes up problems that are not programmable, but he goofed up this time, big time. This one is eminently programmable. I am not even a programmer, more a scripter. I don’t want to have an unfair advantage so I’m sharing my Perl scripts that solve this problem.

The setup
Find a Linux machine. A Raspberry Pi makes a great economical choice if you don’t have anything like that.

Scrape the 50 state capitals from a web page after arriving at a page (I used Wikipedia) that lists them. By scrape I mean copy-and-paste into a file in a terminal window of your Linux server.

Same for US cities. I went to a more obscure listing and picked up the 666 of the largest cities from a listing of the top 1000.

Cleaning up the Capitals
Here’s a script for processing those capital cities. The input file contains lines like these first two:

Alabama         AL      1819    Montgomery      1846    155.4   2       205,764         374,536         Birmingham is the s
tate's largest city
Alaska  AK      1959    Juneau  1906    2716.7  3       31,275          Juneau is the largest capital by land area. Anchora
ge is the state's largest city.
...

I called it cap-fltr.pl

#!/usr/bin/perl
while(<STDIN>) {
  ($cap) = /\d{4}\s+(\D+)\s+\d{4}/;
  $cap = lc $cap;
  $cap =~ s/\s//g;
  print "$cap\n";
}

Create a clean listing of capitals like this:

$ ./cap-fltr.pl < capitals.orig > capitals

assuming, that is, that you dumped your paste buffer of scraped capitals and related information into a file called capitals.orig. So that creates a file called capitals looking like this:

montgomery
juneau
phoenix
littlerock
sacramento
denver
...

Cleaning our cities
My scraped cities file morecities.orig looks like this:

1       New York - city         New York        8,491,079       5.9%
2       Los Angeles - city      California      3,928,864       6.0%
3       Chicago - city  Illinois        2,722,389       -5.9%
4       Houston - city  Texas   2,239,558       13.2%
5       Philadelphia - city     Pennsylvania    1,560,297       3.0%
6       Phoenix - city  Arizona         1,537,058       15.8%
...

These can be cleaned up with this script, which I call morecities-fltr.pl:

#!/usr/bin/perl
while(<STDIN>) {
  ($city) = /^\d+\s+([^-]+)\s-/;
  $city = lc $city;
  $city =~ s/\s//g;
  if ($city =~ /st\./) {
    $form1 = $city;
    $form1 =~ s/\.//g;
    print "$form1\n";
  }
  $city =~ s/st\./saint/;
  print "$city\n";
}

Note in this and the capitals filter I get rid of space, lowercase everything and here I consider st and saint as alternate forms, just in case. Those puzzle-masters can be tricky!

So the cleanup goes into a cities file with a command like this:

$ ./morecities-fltr.pl < morecities.orig > cities

Finally the program that solves the puzzle…
The solution script, which I called sol2.pl, isn’t too bad if you know a tiny bit of Perl. It is this:

#!/usr/bin/perl
$DEBUG = 0;
open(CAPITALS,"capitals");
open(CITIES,"cities");
@cities = <CITIES>;
@capitals = <CAPITALS>;
foreach $capital (@capitals) {
  chomp($capital);
  print "capital: $capital\n" if $DEBUG;
  $lenCap = length $capital;
  foreach $city (@cities) {
    chomp($city);
    print "city: $city\n" if $DEBUG;
    $lenCity = length $city;
    next unless $lenCap == $lenCity + 1;
# put the individual letters of the city into an array
    @citychars = split //, $city;
    $capitalcopy = $capital;
# the following would be a more crude way to do it, leaving us results to be picked over by hand
    #$capitalcopy =~ s/[$city]//g;
# this is the heart of it - eat the characters of the capital away, one-by-one, with the characters of the city
# this provides for an exact, no-thinking solution
    foreach $citychar (@citychars) {
      $capitalcopy =~ s/$citychar//;
    }
    print "capitalcopy: $capitalcopy\n" if $DEBUG;
# solution occurs when the copy of the capital is left with excatly one character
    print "capital, city: $capital, $city\n" if $capitalcopy =~ /^\w$/;
  }
}

With the comments it’s pretty self-documenting.

I’ll publish my answer after the deadline. you have to do some work after all!

Python approach
This hasn’t been thoroughly tested yet. A friend has suggested the following python code might work.

current_city = ""
current_capital = ""
missed_letters = 0
 
 
for count, capital in enumerate(capitals):
    missed_letters = 0
    current_capital = capital
    print current_capital, "current capital"
 
    for count, city in enumerate(cities):
        current_city = city
        print current_city, "current city"
 
        for count, letter in enumerate(capital):
            while missed_letters <2:
                if missed_letters <2 and len(city)== count+1:
                    print current_capital, "Take one letter from this capital"
                    print current_city, "Rearrange the letters in the capital to get this city"
 
                elif letter in city:
                    pass
                    print "letter in the capital matched the city"
 
                elif letter not in city:
                    missed_letters+=1
                    print "letters didn't match any in the word"

My answer
$ ./sol2.pl

capital, city: trenton, renton
capital, city: salem, mesa
capital, city: salem, ames

Actual answer
salem -> mesa
st paul -> tulsa

This illustrates a good lesson. Program, yes, but don’t turn off your brain. I was aware of possible saint/st spelling pitfalls, but I only applied it to the cities list. For some reason I overlooked applying this ambiguity to the capitals list! If add stpaul to my capitals list the program finds tulsa instantly – I just tried it.

If there’s any satisfaction, very, very few people got the correct answer.

References and related
Audio clip of the puzzle
Another NPR puzzle amenable to simple computer tools, in this case shell commands, is documented here: puzzle
Raspberry Pis. Here we describe building a four-monitor display with Raspberry Pis.
A good python tutorial: https://docs.python.org/2/tutorial/

Categories
Admin Network Technologies Proxy Security TCP/IP Web Site Technologies

The IT Detective Agency: Cisco Jabber stopped working for some using WAN connections

Intro
This is probably the hardest case I’ve ever encountered. It’s so complicated many people needed to get involved to contribute to the solution.

Initial symptoms

It’s not easy to describe the problem while providing appropriate obfuscation. Over the course of a few days it came to light that in this particular large company for which I consult many people in office locations connected via an MPLS network were no longer able to log in to Cisco Jabber. That’s Cisco’s offering for Instant Messaging. When it works and used in combination with Cisco IP phones it’s pretty good – has some nice features. This major problem was first reported November 17th.

Knee-jerk reactions
Networking problem? No. Network guys say their networks are running fine. They may be a tad overloaded but they are planning to route Internet over the secondary links so all will be good in a few days.
Proxy problem? Nope. proxy guys say their Bluecoat appliances are running fine and besides everyone else is working.
Application problem? Application owner doesn’t see anything out of the ordinary.
Desktop problem? Maybe but it’s unclear.

Methodology
So of the 50+ users affected I recognized two power users that I knew personally and focussed on them. Over the course of days I learned:
– problem only occurs for WAN (MPLS) users
– problem only occurs when using one particular proxy
– if a user tries to connect often enough, they may eventually get in
– users can get in if they use their VPN client
– users at HQ were not affected

The application owner helpfully pointed out the URL for the web-based version of Cisco Jabber: https://loginp.webexconnect.com/… Anyone with the problem also could not log in to this site.

So working with these power users who patiently put up with many test suggestions we learned:

– setting the PC’s MTU to a small value, say 512 up to 696 made it work. Higher than that it generally failed.
– yet pings of up to 1500 bytes went through OK.
– the trace from one guy’s PC showed all his packets re-transmitted. We still don’t understand that.
– It’s a mess of communications to try to understand these modern, encrypted applications
– even the simplest trace contained over 1000 lines which is tough when you don’t know what you’re looking for!
– the helpful networking guy from the telecom company – let’s call him “Regal” – worked with us but all the while declaring how it’s impossible that it’s a networking issue
– proxy logs didn’t show any particular problem, but then again they cannot look into SSL communication since it is encrypted
– disabling Kaspersky helped some people but not others
– a PC with the problem had no problem when put onto the Internet directly
– if one proxy associated with the problem forwarded the requests to another, then it begins to work
– Is the problem reproducible? Yes, about 99% of the time.
– Do other web sites work from this PC? Yes.

From previous posts you will know that at some point I will treat every problem as a potential networking problem and insist on a trace.

Biases going in
So my philosophy of problem solving which had stood the test of time is either it’s a networking problem, or it’s a problem on the PC. Best is if there’s a competition of ideas in debugging so that the PC/application people seek to prove beyond a doubt it is a networking problem and the networking people likewise try to prove problem occurs on the PC. Only later did I realize the bias in this approach and that a third possibility existed.

So I enthused: what we need is a non-company PC – preferably on the same hardware – at the same IP address to see if there’s a problem. Well we couldn’t quite produce that but one power user suggested using a VM. He just happened to have a VM environment on his PC and could spin up a Windows 7 Professional generic image! So we do that – it shows the problem. But at least the trace form it is a lot cleaner without all the overhead of the company packages’ communication.

The hard work
So we do the heavy lifting and take a trace on both his VM with the problem and the proxy server and sit down to compare the two. My hope was to find a dropped packet, blame the network and let those guys figure it out. And I found it. After the client hello (this is a part of the initial SSL protocol) the server responds with its server hello. That packet – a largeish packet of 1414 bytes – was not coming through to the client! It gets re-transmitted multiple times and none of the re-transmits gets through to the PC. Instead the PC receives a packet the proxy never sent it which indicates a fatal SSL error has occurred.

So I tell Regal that look there’s a problem with these packets. Meanwhile Regal has just gotten a new PC and doesn’t even have Wireshark. Can you imagine such a world? It seems all he really has is his tongue and the ability to read a few emails. And he’s not convinced! He reasons after all that the network has no intelligent, application-level devices and certainly wouldn’t single out Jabber communication to be dropped while keeping everything else. I am no desktop expert so I admit that maybe some application on the PC could have done this to the packets, in effect admitting that packets could be intercepted and altered by the PC even before being recorded by Wireshark. After all I repeated this mantra many times throughout:

This explanation xyz is unlikely, and in fact any explanation we can conceive of is unlikely, yet one of them will prove to be correct in the end.

Meanwhile the problem wasn’t going away so I kludged their proxy PAC file to send everyone using jabber to the one proxy where it worked for all.

So what we really needed was to create a span port on the switch where the PC was plugged in and connect a 2nd PC to a port enabled in promiscuous mode with that mirrored traffic. That’s quite a lot of setup and we were almost there when our power user began to work so we couldn’t reproduce the problem. That was about Dec 1st. Then our 2nd power user also fell through and could no longer reproduce the problem either a day later.

10,000 foot view
What we had so far is a whole bunch of contradictory evidence. Network? Desktop? We still could not say due to the contradictions, the likes of which I’ve never witnessed.

Affiliates affected and find the problem
Meanwhile an affiliate began to see the problem and independently examined it. They made much faster progress than we did. Within a day they found the reason (suggested by their networking person from the telecom, who apparently is much better than ours): the server hello packet has the expedited forwarding (EF) flag set in the differentiated code services point (DSCP) section of the IP header.

Say what?
So I really got schooled on this one. I was saying It has to be an application-aware “something” on the network or PC that is purposefully messing around with the SSL communication. That’s what the evidence screamed to me. So a PC-based firewall seemed a strong contender and that is how Regal was thinking.

So the affiliate explained it this way: the company uses QOS on their routers. Phone (VOIP) gets priority and is the only application where the EF bit is expected to be set. VOIP packets are small, by the way. Regular applications like web sites should just use the default QOS. And according to Wikipedia, many organizations who do use QOS will impose thresholds on the EF pakcets such that if the traffic exceeds say 30% of link capacity drop all packets with EF set that are over a certain size. OK, maybe it doesn’t say that, but that is what I’ve come to understand happens. Which makes the dropping of these particular packets the correct behaviour as per the company’s own WAN contract and design. Imagine that!

Smoking gun no more
So now my smoking gun – blame it on the network for dropped packets – is turned on its head. Cisco has set this EF bit on its server hello response on the loginp.webexconnect.com web site. This is undesirable behaviour. It’s not a phone call after all which requires a minimum jitter in packet timing.

So next time I did a trace I found that instead of EF flag being set, the AF (Assured Forwarding) flag was set. I suppose that will make handling more forgiving inside the company’s network, but I was told that even that was too much. Only default value of 0 should be set for the DSCP value. This is an open issue in Cisco’s hands now.

But at least this explains most observations. Small MTU worked? Yup, those packets are looked upon more favorably by the routers. One proxy worked, the other did not? Yup, they are in different data centers which have different bandwidth utilization. The one where it was not working has higher utilization. Only affected users are at WAN sites? Yup, probably only the WAN routers are enforcing QOS. Worked over VPN, even on a PC showing the problem? Yup – all VPN users use a LAN connection for their proxy settings. Fabricated SSL fatal error packet? I’m still not sure about that one – guess the router sent it as a courtesy after it decided to drop the server hello – just a guess. Problem fixed by shutting down Kaspersky? Nope, guess that was a red herring. Every problem has dead ends and red herrings, just a fact of life. And anyway that behaviour was not very consistent. Problem started November 17th? Yup, the affiliate just happened to have a baseline packet trace from November 2nd which showed that DSCP was not in use at that time. So Cisco definitely changed the behaviour of Cisco Jabber sometime in the intervening weeks. Other web sites worked, except this one? Yup, other web sites do not use the DSCP section of the IP header so it has the default value of 0.

Conclusion
Cisco has decided to remove the DSCP flag from these packets, which will fix everything. Perhaps EF was introduced in support of Cisco Jabber’s extended use as a soft phone??? Then this company may have some re-design of their QOS to take care of because I don’t see an easy solution. Dropping the MTU on the proxy to 512 seems pretty drastic and inefficient, though it would be possible. My reading of TCP is that nothing prevents QOS from being set on any sort of TCP packet even though there may be a gentleman’s agreement to not ordinarily do so in all except VOIP packets or a few other special classes. I don’t know. I’ve really never looked at QOS before this problem came along.

The company is wisely looking for a way to set all packets with DSCP = 0 on the Intranet, except of course those like VOIP where it is explicitly supposed to be used. This will be done on the Internet router. In Cisco IOS it is possible with a policy map and police setting where you can set set-dscp-transmit default. Apparently VPN and other things that may check the integrity of packets won’t mind the DSCP value being altered – it can happen anywhere along the route of the packet.

Boy applications these days are complicated! And those rare times when they go wrong really require a bunch of cooperating experts to figure things out. No one person holds all the expertise any longer.

My simplistic paradigm of its either the PC or the network had to make room for a new reality: it’s the web site in the cloud that did them in.

Could other web sites be similarly affected? Yes it certainly seems a possibility. So I now know to check for use of DSCP if a particular web site is not working, but all others are.

References and related
This Wikipedia article is a good description of DSCP: https://en.wikipedia.org/wiki/Differentiated_services

Categories
Admin CentOS Linux Security

The IT Detective Agency: WordPress login failure leads to discovery of ssh brute force attack

Intro
Yes my WordPress instance never gave any problems for years. Then one day my usual username/password wouldn’t log me in! One thing led to another until I realized I was under an ssh brute force attack from Hong Kong. Then I implemented a software that really helped the situation…

The details
Login failure

So like anyone would do, I double-checked that I was typing the password correctly. Once I convinced myself of that I went to an ssh session I had open to it. When all else fails restart, right? Except this is not Windows (I run CentOS) so there’s no real need to restart the server. There very rarely is.

Mysql fails to start
So I restarted mysql and the web server. I noticed mysql database wasn’t actually starting up. It couldn’t create a PID file or something – no space left on device.

No space on /
What? I never had that problem before. In an enterprise environment I’d have disk monitors and all that good stuff but as a singeleton user of Amazon AWS I suppose they could monitor and alert me to disk problems but they’d probably want to charge me for the privilege. So yeah, a df -k showed 0 bytes available on /. That’s never a good thing.

/var/log very large
So I ran a du -k from / and sent the output to /tmp/du-k so I could preview at my leisure. Fail. Nope, can’t do that because I can’t write to /tmp because it’s on the / partition in my simple-minded server configuration! OK. Just run du -k and scan results by eye… I see /var/log consumes about 3 GB out of 6 GB available which is more than I expected.

btmp is way too large
So I did an ls -l in /var/log and saw that btmp alone is 1.9 GB in size. What the heck is btmp? Some searches show it to be a log use to record ssh login attempts. What is it recording?

Disturbing contents of btmp
I learned that to read btmp you do a
> last -f btmp
The output is zillions of lines like these:

root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
...

I estimate roughly 3.7 login attempts per second. And it’s endless. So I consider it a brute force attack to gain root on my server. This estimate is based on extrapolating from a 10-minute interval by doing one of these:

> last -f btmp|grep ‘Oct 26 14:5’|wc

and dividing the result by 10 min * 60 s/min.

First approach to stop it
I’m at networking guy at heart and remember when you have a hammer all problems look like nails 😉 ? What is the network nail in this case? The attacker’s IP address of course. We can just make sure packets originating from that IP can’t get returned form my server, by doing one of these:

> route add -host 43.229.53.13 gw 127.0.0.1

Check it with one of these:

> netstat -rn

Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
43.229.53.13    127.0.0.1       255.255.255.255 UGH       0 0          0 lo
10.185.21.64    0.0.0.0         255.255.255.192 U         0 0          0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 eth0
0.0.0.0         10.185.21.65    0.0.0.0         UG        0 0          0 eth0

Then watch the btmp grow silent since now your server sends the reply packets to its loopback interface where they die.

Short-lived satisfaction
But the pleasure and pats on your back will be short-lived as a new attack from a new IP will commence within the hour. And you can squelch that one, too, but it gets tiresome as you stay up all night keeping up with things.

Although it wouldn’t bee too too difficult to script the recipe above and automate it, I decided it might even be easier still to find a package out there that does the job for me. And I did. It’s called

fail2ban

You can get it from the EPEL repository of CentOS, making it particularly easy to install. Something like:

$ yum install fail2ban

will do the trick.

I like fail2ban because it has the feel of a modern package. It’s written in python for instance and it is still maintained by its author. There are zillions of options which make it daunting at first.

To stop these ssh attacks in their tracks all you need is to create a jail.local file in /etc/fail2ban. Mine looks like this:

# DrJ - enable sshd monitoring
[DEFAULT]
bantime = 3600
# exempt CenturyLink
ignoreip = 76.6.0.0/16  71.48.0.0/16
#
[sshd]
enabled = true

Then reload it:

$ service fail2ban reload

and check it:

$ service fail2ban status

fail2ban-server (pid  28459) is running...
Status
|- Number of jail:      1
`- Jail list:   sshd

And most sweetly of all, wait a day or two and appreciate the marked change in the contents of btmp or secure:

support  ssh:notty    117.4.240.22     Mon Nov  2 07:05    gone - no logout
support  ssh:notty    117.4.240.22     Mon Nov  2 07:05 - 07:05  (00:00)
dff      ssh:notty    62.232.207.210   Mon Nov  2 03:38 - 07:05  (03:26)
dff      ssh:notty    62.232.207.210   Mon Nov  2 03:38 - 03:38  (00:00)
zhangyan ssh:notty    62.232.207.210   Mon Nov  2 03:38 - 03:38  (00:00)
zhangyan ssh:notty    62.232.207.210   Mon Nov  2 03:38 - 03:38  (00:00)
support  ssh:notty    117.4.240.22     Sun Nov  1 22:47 - 03:38  (04:50)
support  ssh:notty    117.4.240.22     Sun Nov  1 22:47 - 22:47  (00:00)
oracle   ssh:notty    180.210.201.106  Sun Nov  1 20:44 - 22:47  (02:03)
oracle   ssh:notty    180.210.201.106  Sun Nov  1 20:44 - 20:44  (00:00)
a        ssh:notty    180.210.201.106  Sun Nov  1 20:44 - 20:44  (00:00)
a        ssh:notty    180.210.201.106  Sun Nov  1 20:44 - 20:44  (00:00)
openerp  ssh:notty    123.212.42.241   Sun Nov  1 20:40 - 20:44  (00:04)
openerp  ssh:notty    123.212.42.241   Sun Nov  1 20:40 - 20:40  (00:00)
dff      ssh:notty    187.210.58.215   Sun Nov  1 20:36 - 20:40  (00:04)
dff      ssh:notty    187.210.58.215   Sun Nov  1 20:36 - 20:36  (00:00)
zhangyan ssh:notty    187.210.58.215   Sun Nov  1 20:36 - 20:36  (00:00)
zhangyan ssh:notty    187.210.58.215   Sun Nov  1 20:35 - 20:36  (00:00)
root     ssh:notty    82.138.1.118     Sun Nov  1 19:57 - 20:35  (00:38)
root     ssh:notty    82.138.1.118     Sun Nov  1 19:49 - 19:57  (00:08)
root     ssh:notty    82.138.1.118     Sun Nov  1 19:49 - 19:49  (00:00)
root     ssh:notty    82.138.1.118     Sun Nov  1 19:49 - 19:49  (00:00)
PlcmSpIp ssh:notty    82.138.1.118     Sun Nov  1 18:42 - 19:49  (01:06)
PlcmSpIp ssh:notty    82.138.1.118     Sun Nov  1 18:42 - 18:42  (00:00)
oracle   ssh:notty    82.138.1.118     Sun Nov  1 18:34 - 18:42  (00:08)
oracle   ssh:notty    82.138.1.118     Sun Nov  1 18:34 - 18:34  (00:00)
karaf    ssh:notty    82.138.1.118     Sun Nov  1 18:18 - 18:34  (00:16)
karaf    ssh:notty    82.138.1.118     Sun Nov  1 18:18 - 18:18  (00:00)
vagrant  ssh:notty    82.138.1.118     Sun Nov  1 17:13 - 18:18  (01:04)
vagrant  ssh:notty    82.138.1.118     Sun Nov  1 17:13 - 17:13  (00:00)
ubnt     ssh:notty    82.138.1.118     Sun Nov  1 17:05 - 17:13  (00:08)
ubnt     ssh:notty    82.138.1.118     Sun Nov  1 17:05 - 17:05  (00:00)
...

The attacks still come, yes, but they are so quickly snuffed out that there is almost no chance of correctly guessing a password – unless the attacker has a couple centuries on their hands!

Augment fail2ban with a network nail

Now in my case I had noticed attacks coming from various IPs around 43.229.53.13, and I’m still kind of disturbed by that, even after fail2ban was implemented. Who is that? Arin.net said that range is handled by apnic, the Asia pacific NIC. apnic’s whois (apnic.net) says it is a building in Mong Kok district of Hong Kong. Now I’ve been to Hong Kong and the Mong Kok district. It’s very expensive real estate and I think the people who own that subnet have better ways to earn money than try to pwn AWS servers. So I think probably mainland hackers have a backdoor to this Hong Kong network and are using it as their playground. Just a wild guess. So anyhow I augmented fail2ban with a network route to prevent all such attacks form that network:

$ route add -net 43.229.0.0/16 gw 127.0.0.1

A few words on fail2ban

How does fail2ban actually work? It manipulates the local firewall, iptables, as needed. So it will activate iptables if you aren’t already running it. Right now my iptables looks clean so I guess fail2ban hasn’t found anything recently to object to:

$ iptables -L

Chain INPUT (policy ACCEPT)
target     prot opt source               destination
f2b-sshd   tcp  --  anywhere             anywhere            multiport dports ssh
 
Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
 
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
 
Chain f2b-sshd (1 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere

Indeed, checking my messages file the most recent ban was over an hour ago – in the early morning:

Nov  2 03:38:49 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 62.232.207.210

And here is fail2ban doing its job since the log files were rotated at the beginning of the month:

$ cd /var/log; grep Ban messages

Nov  1 04:56:19 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 185.61.136.43
Nov  1 05:49:21 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 5.8.66.78
Nov  1 11:27:53 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 61.147.103.184
Nov  1 11:32:51 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 118.69.135.24
Nov  1 16:57:05 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 162.246.16.55
Nov  1 17:13:17 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 82.138.1.118
Nov  1 18:42:36 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 82.138.1.118
Nov  1 19:57:55 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 82.138.1.118
Nov  1 20:36:05 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 187.210.58.215
Nov  1 20:44:17 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 180.210.201.106
Nov  2 03:38:49 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 62.232.207.210

Almost forgot to mention
How did I free up space so I could still examine btmp? I deleted an older large log file, secure-20151011 which was about 400 MB. No reboot necessary of course. Mysql restarted successfully as did the web servers and I was back in business logging in to my WP site.

August 2017 update
I finally had to reboot my AWS instance after more than three years. I thought about my ssh usage pattern and decided it was really predictable: I either ssh from home or work, both of which have known IPs. And I’m simply tired of seeing all the hack attacks against my server. And I got better with the AWS console out of necessity.
Put it all together and you get a better way to deal with the ssh logins: simply block ssh (tcp port 22) with an AWS security group rule, except from my home and work.

Conclusion
The mystery of the failed WordPress login is examined in great detail here. The case was cracked wide open and the trails that were followed led to discovery of a brute force attempt to gain root access to the hosting server. Methods were implemented to ward off these attacks. An older log file was deleted from /var/log and mysql restarted. WordPress logins are working once again.

References and related info
fail2ban is documented in a wiki of uneven quality at www.fail2ban.org.
Another tool is DenyHosts. One of the ideas behind DenyHosts – its capability to share data – sound great, but look at this page: http://stats.denyhosts.net/stats.html. “Today’s data” is date-stamped July 11, 2011 – four years ago! So something seems amiss there. So it looks like development was suddenly abandoned four years ago – never a good sign for a security tool.

Categories
Admin

SD Card reader not working after Windows 10 upgrade

Intro
I was more than a little alarmed after an upgrade of my Dell Inspiron with built-in SD card reader failed to work properly after I upgraded from Windows 7 to Windows 10. After the upgrade I inserted an SD card into the reader and nothing happened in File Explorer! this led to some tense moments.

The details
Here’s file Explorer after inserting the SD card:

File Explorer
File Explorer

The DVD drive is nowhere to be found and the same for SD card.

But if I right-click on This PC and select manage it looks like this:

Disk Management

So that would make it seem that Disk 1 is removable media mapped to the E: drive and my DVD player is mapped to the D: drive. Interesting. So let’s try this in File Explorer (known as Windows Explorer in previous version of Windows). Type

E:

in the field where it says Quick access. Sure enough it magically appears:

Now with E: Drive mapped
Now with E: Drive mapped

And I can do the normal File Explorer operations with it.

I think there is a more permanent fix but for me I have no problem typing e: the few times I need to read an SD card.

Oh, and the DVD drive? It was there all along. I see it when I highlight This PC:

Now with DVD drive
Now with DVD drive

Conclusion
If you don’t see your SD card when running Windows 10 don’t panic. It may be there alright. Type E: in the Quick Access field. Or maybe D: or F: – depends on your PC’s configuration, which I’ve shown how to list above. I believe a more permanent fix involves re-installing or repairing a driver, but I haven’t had time to look into it. My approach will get you working quickly in a pinch, like, say, when you have to get the photos off your camera’s SD card because you need them right now.

References and related articles
This Microsoft Technet discussion was helpful to me. It was slow to load however.

Categories
Admin Apache Hosting Service IT Operational Excellence Linux Web Site Technologies

Scaling your apache to handle more requests

Intro
I was running an apache instance very happily with mostly default options until the day came that I noticed it was taking seconds to serve a simple web page – one that it used to serve in 50 ms or so. I eventually rolled up my sleeves to see what could be done about it. It seems that what had changed is that it was being asked to handle more requests than ever before.

The details
But the load average on a 16-core server was only at 2! sar showed no particular problems with either cpu of I/O systems. Both showed plenty of spare capacity. A process count showed about 258 apache processes running.

An Internet search helped me pinpoint the problem. Now bear in mind I use a version of apache I myself compiled, so the file layout looks different from the system-supplied apache, but the ideas are the same. What you need is to increase the number of allowed processes. On my server with its great capacity I scaled up considerably. These settings are in /conf/extra/httpd-mpm.conf in the compiled version. In the system-supplied version on SLES I found the equivalent to be /etc/apache2/server-tuning.conf. To begin with the key section of that file had these values:

<IfModule mpm_prefork_module>
    StartServers             5
    MinSpareServers          5
    MaxSpareServers         10
    MaxRequestWorkers      250
    MaxConnectionsPerChild   0
</IfModule>

(The correct section is <IfModule prefork.c> in the system-supplied apache).

I replaced these as follows:

<IfModule mpm_prefork_module>
    StartServers          256
    MinSpareServers        16
    MaxSpareServers       128
    ServerLimit          2048
    MaxClients           2048
    MaxRequestsPerChild  20000
</IfModule>

Note that ServerLimit has to be greater than or equal to MaxClients (thank you Apache developers!) or you get an error like this when you start apache:

WARNING: MaxClients of 2048 exceeds ServerLimit value of 256 servers,
 lowering MaxClients to 256.  To increase, please see the ServerLimit
 directive.

So you make this change, right, stop/start apache and what difference do you see? Probably none whatsoever! Because you probably forgot to uncomment this line in httpd.conf:

#Include conf/extra/httpd-mpm.conf

So remove the # at the beginning of that line and stop/start. If like me you’ve changed the usual diretory where the PID file and lock file get written in your httpd.conf file you may need this additional measure which I had to do in the httpd-mpm.conf file:

<IfModule !mpm_netware_module>
    #PidFile "logs/httpd.pid"
</IfModule>
 
#
# The accept serialization lock file MUST BE STORED ON A LOCAL DISK.
#
<IfModule !mpm_winnt_module>
<IfModule !mpm_netware_module>
#LockFile "logs/accept.lock"
</IfModule>
</IfModule>

In other words I commented out this file’s attempt to place the PID and lock files in a certain place because I have my own way of storing those and it was overwriting my choices!

But with all those changes put together it works much, much better than before and can handle more requests than ever.

Analysis
In creating a simple benchmark we could easily scale to 400 requests / second, and we didn’t really even try to push it – and this was before we changed any parameters. So why couldn’t 250 or so simultaneous processes handle more real world requests? I believe that if all clients were as fast as our server it could have handled them all. But the clients themselves were sometimes distant (thousands of miles) with slow or lossy connections. Then they need to acknowledge every packet sent by the web server and the web server has to wait around for that, unable to go on to the next client request! Real life is not like laboratory testing. As the waiting around bit requires next-to-no cpu the load average didn’t rise even though we had run up against a limit, the limit was an artificial application-imposed one, not a system-imposed resource constraint.

More analysis, what about threads?

Is this the only or best way to scale up your web server? Probably not. It’s probably the most practical however because you probably didn’t compile it with support for threads. I know I didn’t. Or if you’re using the system-provided package it probably doesn’t support threads. Find your httpd binary. Run this command:

$ ./httpd -l|grep prefork

If it returns:

  prefork.c

you have the prefork module and not the worker module and the above approach is what you need to do. To me a more modern approach is to scale by using threads – modern cpus are designed to run threads, which are kind of like light-weight processes. But, oh well. The gatekeepers of apache packages seem stuck in this simple-minded one process per request mindset.

Conclusion
My scaled-up apache is handling more requests than ever. I’ve documented how I increased the total process count.

References and related articles
How I compiled apache 2.4 and ran into (and resolved) a zillion errors seems to be a popular post!
The mystery of why we receive hundreds or even thousands of PAC file requests from each client every day remains unsolved to this day. That’s why we needed to scale up this apache instance – it is serving the PAC file. I first wrote about it three and a half years ago!04