Categories
Admin Linux

Narrowing down answer to NPR puzzle with Linux commands

Intro
This is for CentOS and RedHat Linux.

Narrow things down
$ egrep ′^[a-z]{6}$′ /usr/share/dict/linux.words |sed ′s/.//′|s
ort|uniq -c|sort -k1 -d -r > 6-ltr-last-5

Mind the line break in the display of this command – you have to join things back together.

This is a great string of commands to study if you want to unleash the power of the linux shell. Yuo have a matching operator, egrep, a simple regular expression, a substitution command, sed, a sort command, sort, a unique sort command, uniq, and a sort ordered by number and displayed in reverse order. I issue commands like this frequently against log files and can do much more import work than solving an NPR puzzle.

6-ltr-last-5 starts like this:

     14 itter
     14 ingle
     14 atter
     14 agged
     13 etter
     13 ester
     13 aster
     13 apper
     13 apped
     13 agger
...

It has 772 lines with 4 or greater occurrences – too many to process by hand.

Edit this file and only keep the top part of the file up until the last of the 4 occurrences.

Now go back and match these words against the dictionary.

$ cat 6-ltr-last-5 |awk ′{print $2}′|while read line;do egrep ′
^[a-z]′$line$ /usr/share/dict/linux.words >> 6-ltr-combos;echo " ">>6-ltr-combos; done

6-ltr-combos starts like this:

bitter
fitter
gitter
hitter
jitter
kitter
litter
nitter
pitter
ritter
sitter
titter
witter
zitter
 
bingle
cingle
dingle
gingle
hingle
jingle
...

Small program to process that file
OK, to work with that file we just created based on the logic of the problem statement, I created this custom perl script which I call 6-5.pl:

#!/usr/bin/perl
$DEBUG = 0;
$consonants = 'bcdfghjlmnpqrstvwxyz';
$oldplace = -1;
$pot = 0;
while(<STDIN>){
  if (/^\s/) {
    print "pot,start word = $pot, $startword\n" if $pot > 3;
# reset some  values
    $oldplace = -1;
    $pot = 0;
    $startword = $_;
  }
  chomp;
# get at first character
  ($char) = /^(\w)/;
# turn character into position number with this
  $place = index $consonants,$char;
  print "word,place: $_,$place\n" if $DEBUG;
  if ($place != $oldplace + 1) {
# clear things out
    print "pot,start word = $pot, $startword\n" if $pot > 3;
    $pot  = 1;
    $startword = $_;
  } else {
    $pot++;
  }
  print "pot: $pot\n" if $DEBUG;
  $oldplace = $place;
}

I really wish I knew Python – I bet it would be an even shorter script in that language. But this gets the job done. It’s warts and all as I have done enough debugging to get it to return mostly reasonable output, but it’s still not quite right. It’s good enough…

Run it:

$ ./6-5.pl < 6-ltr-combos

pot,start word = 4, fitter
pot,start word = 7, gingle
pot,start word = 4, latter
pot,start word = 8, dagged
pot,start word = 5, fetter
pot,start word = 5, jester
pot,start word = 6, dagger
...

The biggest problem is my dictionary contains too many uncommon words, but at least that guarantees that the answer will indeed be present. And it is. In fact I found three sets of what I consider common words. One set are very ordinary words so i guess that is the intended answer. I can’t give away everything right now – you’ll have to do some work! I’ll post the answers after Sunday.

References and related
A similar approach to a previous puzzle is here.

Categories
Network Technologies Python

Tips on using scapy for custom IP packets

Intro
scapy is an IP packet customization tool that keeps coming up in my searches so I could no longer avoid it. I was unnecessarily intimidated because it was built around python and the documentation is a little strange. But I’m warming up to it now…

The details
Download and install
CentOS
Just go to scapy.net and it will propose to you to download the .zip file. I got scapy-2.3.1.zip. Then you can unzip it; change directory to the scapy-2.3.1 sub-directory and run

$ sudo python setup.py install

Debian systems such as Raspberry Pi
Simple. It’s just:

$ sudo apt-get install python-scapy

Usage modes
scapy can be called from within python, but if you’re afraid to do that like I am, you can run it from the command line which simply throws you into a python shell. I’m finding that a lot more comfortable as I slowly learn python syntax and some useful shortcuts.

Example 1
The background
Let’s cut to the chase and do something hard first. Remember how we got those Cisco Jabber packets with DSCP set, causing Cisco Jabber to not work for some users? The long-term solution according to that post is to turn off the DSCP flag for all packets on the Internet router. So we want to be able to generate packets under our control with that flag set so we can see if we’ve managed to turn it off correctly.

DSCP value occupies the first 6 bits of the 8-bit tos field. The packets we got from Cisco had DSCP of 0x2e which is Expedited Forwarding (EF), and if you do the math that corresponds to tos of 0xb8 which in decimal is 184.

$ sudo scapy
>>> sr(IP(dst="50.17.188.196",tos=184)/TCP(dport=80,sport=4025))

Begin emission:
....Finished to send 1 packets.
.*
Received 6 packets, got 1 answers, remaining 0 packets
(<Results: TCP:1 UDP:0 ICMP:0 Other:0>, <Unanswered: TCP:0 UDP:0 ICMP:0 Other:0>)
>>>

Instead of the call to sr you can simply use send. Breaking this down, I’m testing against my drjohns server with IP 50.17.188.196. tos is a property of an IP packet so it’s included as a keyword argument to the IP function. The “/” following the IP function is funny syntax but it somehow says that more properties at different layers are coming. So in the TCP section I used keyword arguments and set source port of 4025 and destination port of 80. What I observed is that this will send a SYN packet even though I didn’t explicitly identify that.

Want to have a random source port like “real” packets? Then use this:

$ >>> sr(IP(dst="50.17.188.196",tos=184)/TCP(dport=80,sport=RandShort()))

Look for it
I know tcpdump better so I look for my packet with that tool like this:

$ sudo tcpdump -v -n -i eth0 host 71.2.39.115 and port 80

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
19:33:29.749170 IP (tos 0xb8, ttl 39, id 1, offset 0, flags [none], proto TCP (6), length 44)
    71.2.39.115.partimage > 10.185.21.116.http: Flags [S], cksum 0xd97b (correct), seq 0, win 8192, options [mss 1460], length 0
19:33:29.749217 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 44)
    10.185.21.116.http > 71.2.39.115.partimage: Flags [S.], cksum 0x3e39 (correct), seq 3026513916, ack 1, win 5840, options [mss 1460], length 0
19:33:29.781193 IP (tos 0x0, ttl 41, id 19578, offset 0, flags [DF], proto TCP (6), length 40)

Interpretation
Our tos was wiped clean by the time our generated packet was received by Amazon AWS. This was a packet I sent from my home using my Raspberry Pi. So likely my ISP CenturyLink is removing QOS from packets its residential customers send out. With some ISPs and business class service I have seen the tos field preserved exactly. When sent from Amazon AWS I saw the field value altered, but not set to 0!

Example 2, ping
>>> sr(IP(dst="8.8.8.8")/ICMP())

Begin emission:
Finished to send 1 packets.
.*
Received 2 packets, got 1 answers, remaining 0 packets
(<Results: TCP:0 UDP:0 ICMP:1 Other:0>, <Unanswered: TCP:0 UDP:0 ICMP:0 Other:0>)

Getting info on return packet
$ >>> sr1(IP(dst="drjohnstechtalk.com",tos=184)/TCP(dport=80,sport=RandShort()))

Begin emission:
...............................................................................................Finished to send 1 packets.
...........................................*
Received 139 packets, got 1 answers, remaining 0 packets
<IP  version=4L ihl=5L tos=0x0 len=44 id=0 flags=DF frag=0L ttl=25 proto=tcp chksum=0xe1d7 
src=50.17.188.196 dst=144.29.1.2 options=[] |<TCP  sport=http dport=17176 seq=3590570804 ack=1 dataofs=6L reserved=0L flags=SA window=5840 chksum=0x24b0 urgptr=0 options=[('MSS', 1460)] |<Padding  load='\x00\x00' |>>>

Note that this tells me about the return packet, which is a SYN ACK. So it tells me my SYN packet must have been sent from port 17176 (it changes every time because I’ve included sport=RandShort()). Each “.” in the response indicates a packet hitting the interface. I guess it’s promiscuously listening on the interface.

Hitting a closed port

$ >>> sr1(IP(dst="drjohnstechtalk.com",tos=184)/TCP(dport=81,sport=RandShort()))

Begin emission:
....................................................................................
.........................................................................................................Finished to send 1 packets.
...............................................................................
.................................................................................
.........................................................................................

Basically those dots are going to keep going forever until you type -C, because there will be no return packet if something like a firewall is dropping your packet, or the returned packet.

Useful shortcuts
The scapy commands look pretty daunting at first, right? And too much trouble to type in, right? Just get it right once and you’re set. In typical networking debugging you’ll be running such test packets multiple times. Because it’s basically a python shell, you can use the up arrow key to recall the previous thing, or hit it multiple times to scroll through your previously typed commands. And even if you exit and return, it still remembers your command history so you can hit the up-arrow to get back to your commands from previous sesisons and previous days.

References and related
This scapy for dummies guide is very well written.
I’m finding this python tutorial really helpful.
DSCP and explanation of Cisco Jabber not working is described here.
A simpler tool which is fine for most things is nmap. I provide some real-world examples in this blog post.

Categories
Admin Web Site Technologies

The IT Detective agency: Outlook client is Disconnected, all else fine

Intro
Today we were asked to consult on the following problem. Some proxy users at a large company could not connect to Microsoft Outlook. Only a few users were affected. Fix it.

The details
Affected users would bring up Outlook and within a few short seconds it would simply show Disconnected and stay that way.

It was quickly established that the affected users shared this in common: they use LDAP authentication and proxy-basic-authentication. The users who worked used NTLM authentication. The way they distinguish one from the other is by using a different proxy autoconfiguration (PAC) file.

More observations
Well, actually there was almost no difference whatsoever between the two PAC files. They are syntactically identical. The only difference in fact is that a different proxy is handed out for the NTLM users. That’s it!

We were able to reproduce the problem ourselves by using the same PAC file as the affected user. We tried to trace the traffic on our desktop but it was a complete mess. I did not see any connection to the designated proxy for Outlook traffic, but it’s hard to say definitively because there is so much other junk present. Strangely, all web sites worked OK and even the web-based version of Outlook works OK. So this Outlook client was the only known application having a problem.

When the affected users put in the proxy directly in manual proxy settings in IE and turned off proxy autoconfig, then Outlook worked. Strange.

We observed the header for the PAC file was a little bit inconsistent (it was being served from multiple web servers through a load balancer). The content-tyep MIME header was coming back as either text/plain or there was no such header at all, depending on which web server you were hitting. But note that the NTLM users were also getting PAC files with this same header.

The solution

Although everything had been fine with this header situation up until the introduction of Outlook, we guessed it was technically incorrect and should be fixed. We changed all web servers to have the PAC file be served with this MIME header:

Content-Type: application/x-ns-proxy-autoconfig

The results

A re-test confirmed that this fixed the Outlook problem for the LDAP-affected users. NTLM users were not impacted and continued to work fine.

Conclusion
A strange Outlook connection problem was resolved in large company Intranet by adjusting the PAC file to include the correct content-type header. Case closed!

References and related information
Here’s a PAC file case we never did resolve: excessive calls to the PAC file web server from individual users.

Categories
Admin

SiteScope keeps restarting

Intro
I’m just documenting what the support tech had me do to fix this scary issue.

The details
This was a SiteScope v 11.24 instance running on a RHEL 6.6 VM.

2015-12-08 05:12:56,768 [SiteScope Main Thread] (SiteScopeSupport.java:721) ERROR - SiteScope unexpected shutdown
java.lang.reflect.UndeclaredThrowableException
        at com.sun.proxy.$Proxy106.postInit(Unknown Source)
        at com.mercury.sitescope.platform.configmanager.ConfigManager$PersistencyAdaptor.initPersistObjectsAfterLoad(ConfigManager.java:1967)
        at com.mercury.sitescope.platform.configmanager.ConfigManager$PersistencyAdaptor.initialize(ConfigManager.java:1247)
        at com.mercury.sitescope.platform.configmanager.ConfigManager$PersistencyAdaptor.access$300(ConfigManager.java:1112)
        at com.mercury.sitescope.platform.configmanager.ConfigManager.initialize(ConfigManager.java:145)
        at com.mercury.sitescope.platform.configmanager.ConfigManagerSession.initialize(ConfigManagerSession.java:153)
        at com.mercury.sitescope.bootstrap.SiteScopeSupport.initializeSiteScope(SiteScopeSupport.java:592)
        at com.mercury.sitescope.bootstrap.SiteScopeSupport.configureSiteScope(SiteScopeSupport.java:629)
        at com.mercury.sitescope.bootstrap.SiteScopeSupport.siteScopeMain(SiteScopeSupport.java:678)
        at com.mercury.sitescope.web.servlet.InitSiteScope$SiteScopeMainThread.run(InitSiteScope.java:233)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedMethodAccessor109.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.mercury.sitescope.platform.configmanager.ManagedObjectConfigRef$ManagedObjectProxyHandler.invoke(ManagedObjectConfigRef.java:290)
        ... 10 more
Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
        at com.mercury.sitescope.entities.monitors.MonitorGroup.readDynamic(MonitorGroup.java:445)
        at com.mercury.sitescope.entities.monitors.MonitorGroup.postInit(MonitorGroup.java:2001)
        ... 14 more
2015-12-08 05:12:56,776 [SiteScope Main Thread] (SiteScopeShutdown.java:51) INFO  - Shutting down SiteScope reason Exception: java.lang.reflect.UndeclaredThrowableException null...
2015-12-08 05:12:56,784 [ShutdownThread at Tue Dec 08 05:12:56 EST 2015] (SiteScopeGroup.java:1061) INFO  - Stopping dynamic counters flow...
2015-12-08 05:12:56,832 [ShutdownThread at Tue Dec 08 05:12:56 EST 2015] (SiteScopeGroup.java:1112) INFO  - Waiting 40 secs to allow monitors to complete.
2015-12-08 05:13:36,854 [ShutdownThread at Tue Dec 08 05:12:56 EST 2015] (SiteScopeGroup.java:1280) INFO  - Average Monitors Running: 0
2015-12-08 05:13:36,855 [ShutdownThread at Tue Dec 08 05:12:56 EST 2015] (SiteScopeGroup.java:1281) INFO  - Peak Monitors Per Minute: 0 at 7:00 pm 12/31/69
2015-12-08 05:13:36,855 [ShutdownThread at Tue Dec 08 05:12:56 EST 2015] (SiteScopeGroup.java:1283) INFO  - Peak Monitors Running: 0.0 at 7:00 pm 12/31/69
2015-12-08 05:13:36,855 [ShutdownThread at Tue Dec 08 05:12:56 EST 2015] (SiteScopeGroup.java:1284) INFO  - Peak Monitors Waiting: 0 at 7:00 pm 12/31/69

And this just kept happening and happening.

The solution
The support tech from HPE had me go in the groups directory and delete all files except those ending in .dyn and .config. Those directories are in the /opt/HP/SiteScope directory on my installation.

In the persistency directory we deleted all files ending in .tmp. But we made saved copies of the entire original groups and persistency directories elsewhere just in case.

The results
HP siteScope started just fine after that! A healthy siteScope startup includes lines like these:

2015-12-08 08:59:12,149 [SiteScope Main] (SiteScopeGroup.java:995) INFO  - Open your web browser to:
2015-12-08 08:59:12,149 [SiteScope Main] (SiteScopeGroup.java:996) INFO  -   http://10.192.136.89:8080
2015-12-08 08:59:12,180 [SiteScope Main] (SiteScopeGroup.java:324) INFO  - Starting common scheduler...
2015-12-08 08:59:12,273 [SiteScope Main] (SiteScopeGroup.java:344) INFO  - Starting maintenance scheduler...
2015-12-08 08:59:12,381 [SiteScope Main] (SiteScopeGroup.java:609) INFO  - Starting topaz manager
2015-12-08 08:59:12,384 [SiteScope Main] (SiteScopeGroup.java:611) INFO  - Topaz manager started.
2015-12-08 08:59:12,386 [SiteScope Main] (SiteScopeGroup.java:457) INFO  - Starting monitor scheduler...
2015-12-08 08:59:12,386 [SiteScope Main] (SiteScopeGroup.java:462) INFO  - Starting report scheduler...
2015-12-08 08:59:12,398 [SiteScope Main] (SiteScopeGroup.java:477) INFO  - Starting analytics scheduler...
2015-12-08 08:59:12,398 [SiteScope Main] (SiteScopeGroup.java:488) INFO  -
2015-12-08 08:59:12,398 [SiteScope Main] (SiteScopeGroup.java:489) INFO  - SiteScope is active...
2015-12-08 08:59:12,493 [SiteScope Main] (SiteScopeGroup.java:513) INFO  - SiteScope Starting all monitors
2015-12-08 08:59:12,830 [SiteScope Main] (SiteScopeGroup.java:517) INFO  - SiteScope Start monitors completed
2015-12-08 08:59:13,037 [SiteScope Main] (SiteScopeGroup.java:532) INFO  - SiteScope Startup Completed
2015-12-08 08:59:13,215 [SiteScope Main] (SiteScopeSupport.java:713) INFO  - SiteScope 11.24.241  build 165 process started at Tue Dec 08 08:59:13 EST 2015
2015-12-08 08:59:13,215 [SiteScope Main] (SiteScopeSupport.java:714) INFO  - SiteScope Start took 26 sec

Especially that last line.

The hypothesis
As the server had crashed the hypothesis is that one of the files must have gotten corrupted.

Conclusion
HP SiteScope which could not start was fixed by removing some older files. These files are not needed, either – your configuration will not be lost when you delete them.

Categories
Linux Perl

Solution to NPR puzzle: capitals and cities

Problem statement
Take a US state capital. Drop a letter. Re-arrange the remaining letters to get the name of a US city. There are two answers.

I know Wlll Shortz says he makes up problems that are not programmable, but he goofed up this time, big time. This one is eminently programmable. I am not even a programmer, more a scripter. I don’t want to have an unfair advantage so I’m sharing my Perl scripts that solve this problem.

The setup
Find a Linux machine. A Raspberry Pi makes a great economical choice if you don’t have anything like that.

Scrape the 50 state capitals from a web page after arriving at a page (I used Wikipedia) that lists them. By scrape I mean copy-and-paste into a file in a terminal window of your Linux server.

Same for US cities. I went to a more obscure listing and picked up the 666 of the largest cities from a listing of the top 1000.

Cleaning up the Capitals
Here’s a script for processing those capital cities. The input file contains lines like these first two:

Alabama         AL      1819    Montgomery      1846    155.4   2       205,764         374,536         Birmingham is the s
tate's largest city
Alaska  AK      1959    Juneau  1906    2716.7  3       31,275          Juneau is the largest capital by land area. Anchora
ge is the state's largest city.
...

I called it cap-fltr.pl

#!/usr/bin/perl
while(<STDIN>) {
  ($cap) = /\d{4}\s+(\D+)\s+\d{4}/;
  $cap = lc $cap;
  $cap =~ s/\s//g;
  print "$cap\n";
}

Create a clean listing of capitals like this:

$ ./cap-fltr.pl < capitals.orig > capitals

assuming, that is, that you dumped your paste buffer of scraped capitals and related information into a file called capitals.orig. So that creates a file called capitals looking like this:

montgomery
juneau
phoenix
littlerock
sacramento
denver
...

Cleaning our cities
My scraped cities file morecities.orig looks like this:

1       New York - city         New York        8,491,079       5.9%
2       Los Angeles - city      California      3,928,864       6.0%
3       Chicago - city  Illinois        2,722,389       -5.9%
4       Houston - city  Texas   2,239,558       13.2%
5       Philadelphia - city     Pennsylvania    1,560,297       3.0%
6       Phoenix - city  Arizona         1,537,058       15.8%
...

These can be cleaned up with this script, which I call morecities-fltr.pl:

#!/usr/bin/perl
while(<STDIN>) {
  ($city) = /^\d+\s+([^-]+)\s-/;
  $city = lc $city;
  $city =~ s/\s//g;
  if ($city =~ /st\./) {
    $form1 = $city;
    $form1 =~ s/\.//g;
    print "$form1\n";
  }
  $city =~ s/st\./saint/;
  print "$city\n";
}

Note in this and the capitals filter I get rid of space, lowercase everything and here I consider st and saint as alternate forms, just in case. Those puzzle-masters can be tricky!

So the cleanup goes into a cities file with a command like this:

$ ./morecities-fltr.pl < morecities.orig > cities

Finally the program that solves the puzzle…
The solution script, which I called sol2.pl, isn’t too bad if you know a tiny bit of Perl. It is this:

#!/usr/bin/perl
$DEBUG = 0;
open(CAPITALS,"capitals");
open(CITIES,"cities");
@cities = <CITIES>;
@capitals = <CAPITALS>;
foreach $capital (@capitals) {
  chomp($capital);
  print "capital: $capital\n" if $DEBUG;
  $lenCap = length $capital;
  foreach $city (@cities) {
    chomp($city);
    print "city: $city\n" if $DEBUG;
    $lenCity = length $city;
    next unless $lenCap == $lenCity + 1;
# put the individual letters of the city into an array
    @citychars = split //, $city;
    $capitalcopy = $capital;
# the following would be a more crude way to do it, leaving us results to be picked over by hand
    #$capitalcopy =~ s/[$city]//g;
# this is the heart of it - eat the characters of the capital away, one-by-one, with the characters of the city
# this provides for an exact, no-thinking solution
    foreach $citychar (@citychars) {
      $capitalcopy =~ s/$citychar//;
    }
    print "capitalcopy: $capitalcopy\n" if $DEBUG;
# solution occurs when the copy of the capital is left with excatly one character
    print "capital, city: $capital, $city\n" if $capitalcopy =~ /^\w$/;
  }
}

With the comments it’s pretty self-documenting.

I’ll publish my answer after the deadline. you have to do some work after all!

Python approach
This hasn’t been thoroughly tested yet. A friend has suggested the following python code might work.

current_city = ""
current_capital = ""
missed_letters = 0
 
 
for count, capital in enumerate(capitals):
    missed_letters = 0
    current_capital = capital
    print current_capital, "current capital"
 
    for count, city in enumerate(cities):
        current_city = city
        print current_city, "current city"
 
        for count, letter in enumerate(capital):
            while missed_letters <2:
                if missed_letters <2 and len(city)== count+1:
                    print current_capital, "Take one letter from this capital"
                    print current_city, "Rearrange the letters in the capital to get this city"
 
                elif letter in city:
                    pass
                    print "letter in the capital matched the city"
 
                elif letter not in city:
                    missed_letters+=1
                    print "letters didn't match any in the word"

My answer
$ ./sol2.pl

capital, city: trenton, renton
capital, city: salem, mesa
capital, city: salem, ames

Actual answer
salem -> mesa
st paul -> tulsa

This illustrates a good lesson. Program, yes, but don’t turn off your brain. I was aware of possible saint/st spelling pitfalls, but I only applied it to the cities list. For some reason I overlooked applying this ambiguity to the capitals list! If add stpaul to my capitals list the program finds tulsa instantly – I just tried it.

If there’s any satisfaction, very, very few people got the correct answer.

References and related
Audio clip of the puzzle
Another NPR puzzle amenable to simple computer tools, in this case shell commands, is documented here: puzzle
Raspberry Pis. Here we describe building a four-monitor display with Raspberry Pis.
A good python tutorial: https://docs.python.org/2/tutorial/

Categories
Admin Network Technologies Proxy Security TCP/IP Web Site Technologies

The IT Detective Agency: Cisco Jabber stopped working for some using WAN connections

Intro
This is probably the hardest case I’ve ever encountered. It’s so complicated many people needed to get involved to contribute to the solution.

Initial symptoms

It’s not easy to describe the problem while providing appropriate obfuscation. Over the course of a few days it came to light that in this particular large company for which I consult many people in office locations connected via an MPLS network were no longer able to log in to Cisco Jabber. That’s Cisco’s offering for Instant Messaging. When it works and used in combination with Cisco IP phones it’s pretty good – has some nice features. This major problem was first reported November 17th.

Knee-jerk reactions
Networking problem? No. Network guys say their networks are running fine. They may be a tad overloaded but they are planning to route Internet over the secondary links so all will be good in a few days.
Proxy problem? Nope. proxy guys say their Bluecoat appliances are running fine and besides everyone else is working.
Application problem? Application owner doesn’t see anything out of the ordinary.
Desktop problem? Maybe but it’s unclear.

Methodology
So of the 50+ users affected I recognized two power users that I knew personally and focussed on them. Over the course of days I learned:
– problem only occurs for WAN (MPLS) users
– problem only occurs when using one particular proxy
– if a user tries to connect often enough, they may eventually get in
– users can get in if they use their VPN client
– users at HQ were not affected

The application owner helpfully pointed out the URL for the web-based version of Cisco Jabber: https://loginp.webexconnect.com/… Anyone with the problem also could not log in to this site.

So working with these power users who patiently put up with many test suggestions we learned:

– setting the PC’s MTU to a small value, say 512 up to 696 made it work. Higher than that it generally failed.
– yet pings of up to 1500 bytes went through OK.
– the trace from one guy’s PC showed all his packets re-transmitted. We still don’t understand that.
– It’s a mess of communications to try to understand these modern, encrypted applications
– even the simplest trace contained over 1000 lines which is tough when you don’t know what you’re looking for!
– the helpful networking guy from the telecom company – let’s call him “Regal” – worked with us but all the while declaring how it’s impossible that it’s a networking issue
– proxy logs didn’t show any particular problem, but then again they cannot look into SSL communication since it is encrypted
– disabling Kaspersky helped some people but not others
– a PC with the problem had no problem when put onto the Internet directly
– if one proxy associated with the problem forwarded the requests to another, then it begins to work
– Is the problem reproducible? Yes, about 99% of the time.
– Do other web sites work from this PC? Yes.

From previous posts you will know that at some point I will treat every problem as a potential networking problem and insist on a trace.

Biases going in
So my philosophy of problem solving which had stood the test of time is either it’s a networking problem, or it’s a problem on the PC. Best is if there’s a competition of ideas in debugging so that the PC/application people seek to prove beyond a doubt it is a networking problem and the networking people likewise try to prove problem occurs on the PC. Only later did I realize the bias in this approach and that a third possibility existed.

So I enthused: what we need is a non-company PC – preferably on the same hardware – at the same IP address to see if there’s a problem. Well we couldn’t quite produce that but one power user suggested using a VM. He just happened to have a VM environment on his PC and could spin up a Windows 7 Professional generic image! So we do that – it shows the problem. But at least the trace form it is a lot cleaner without all the overhead of the company packages’ communication.

The hard work
So we do the heavy lifting and take a trace on both his VM with the problem and the proxy server and sit down to compare the two. My hope was to find a dropped packet, blame the network and let those guys figure it out. And I found it. After the client hello (this is a part of the initial SSL protocol) the server responds with its server hello. That packet – a largeish packet of 1414 bytes – was not coming through to the client! It gets re-transmitted multiple times and none of the re-transmits gets through to the PC. Instead the PC receives a packet the proxy never sent it which indicates a fatal SSL error has occurred.

So I tell Regal that look there’s a problem with these packets. Meanwhile Regal has just gotten a new PC and doesn’t even have Wireshark. Can you imagine such a world? It seems all he really has is his tongue and the ability to read a few emails. And he’s not convinced! He reasons after all that the network has no intelligent, application-level devices and certainly wouldn’t single out Jabber communication to be dropped while keeping everything else. I am no desktop expert so I admit that maybe some application on the PC could have done this to the packets, in effect admitting that packets could be intercepted and altered by the PC even before being recorded by Wireshark. After all I repeated this mantra many times throughout:

This explanation xyz is unlikely, and in fact any explanation we can conceive of is unlikely, yet one of them will prove to be correct in the end.

Meanwhile the problem wasn’t going away so I kludged their proxy PAC file to send everyone using jabber to the one proxy where it worked for all.

So what we really needed was to create a span port on the switch where the PC was plugged in and connect a 2nd PC to a port enabled in promiscuous mode with that mirrored traffic. That’s quite a lot of setup and we were almost there when our power user began to work so we couldn’t reproduce the problem. That was about Dec 1st. Then our 2nd power user also fell through and could no longer reproduce the problem either a day later.

10,000 foot view
What we had so far is a whole bunch of contradictory evidence. Network? Desktop? We still could not say due to the contradictions, the likes of which I’ve never witnessed.

Affiliates affected and find the problem
Meanwhile an affiliate began to see the problem and independently examined it. They made much faster progress than we did. Within a day they found the reason (suggested by their networking person from the telecom, who apparently is much better than ours): the server hello packet has the expedited forwarding (EF) flag set in the differentiated code services point (DSCP) section of the IP header.

Say what?
So I really got schooled on this one. I was saying It has to be an application-aware “something” on the network or PC that is purposefully messing around with the SSL communication. That’s what the evidence screamed to me. So a PC-based firewall seemed a strong contender and that is how Regal was thinking.

So the affiliate explained it this way: the company uses QOS on their routers. Phone (VOIP) gets priority and is the only application where the EF bit is expected to be set. VOIP packets are small, by the way. Regular applications like web sites should just use the default QOS. And according to Wikipedia, many organizations who do use QOS will impose thresholds on the EF pakcets such that if the traffic exceeds say 30% of link capacity drop all packets with EF set that are over a certain size. OK, maybe it doesn’t say that, but that is what I’ve come to understand happens. Which makes the dropping of these particular packets the correct behaviour as per the company’s own WAN contract and design. Imagine that!

Smoking gun no more
So now my smoking gun – blame it on the network for dropped packets – is turned on its head. Cisco has set this EF bit on its server hello response on the loginp.webexconnect.com web site. This is undesirable behaviour. It’s not a phone call after all which requires a minimum jitter in packet timing.

So next time I did a trace I found that instead of EF flag being set, the AF (Assured Forwarding) flag was set. I suppose that will make handling more forgiving inside the company’s network, but I was told that even that was too much. Only default value of 0 should be set for the DSCP value. This is an open issue in Cisco’s hands now.

But at least this explains most observations. Small MTU worked? Yup, those packets are looked upon more favorably by the routers. One proxy worked, the other did not? Yup, they are in different data centers which have different bandwidth utilization. The one where it was not working has higher utilization. Only affected users are at WAN sites? Yup, probably only the WAN routers are enforcing QOS. Worked over VPN, even on a PC showing the problem? Yup – all VPN users use a LAN connection for their proxy settings. Fabricated SSL fatal error packet? I’m still not sure about that one – guess the router sent it as a courtesy after it decided to drop the server hello – just a guess. Problem fixed by shutting down Kaspersky? Nope, guess that was a red herring. Every problem has dead ends and red herrings, just a fact of life. And anyway that behaviour was not very consistent. Problem started November 17th? Yup, the affiliate just happened to have a baseline packet trace from November 2nd which showed that DSCP was not in use at that time. So Cisco definitely changed the behaviour of Cisco Jabber sometime in the intervening weeks. Other web sites worked, except this one? Yup, other web sites do not use the DSCP section of the IP header so it has the default value of 0.

Conclusion
Cisco has decided to remove the DSCP flag from these packets, which will fix everything. Perhaps EF was introduced in support of Cisco Jabber’s extended use as a soft phone??? Then this company may have some re-design of their QOS to take care of because I don’t see an easy solution. Dropping the MTU on the proxy to 512 seems pretty drastic and inefficient, though it would be possible. My reading of TCP is that nothing prevents QOS from being set on any sort of TCP packet even though there may be a gentleman’s agreement to not ordinarily do so in all except VOIP packets or a few other special classes. I don’t know. I’ve really never looked at QOS before this problem came along.

The company is wisely looking for a way to set all packets with DSCP = 0 on the Intranet, except of course those like VOIP where it is explicitly supposed to be used. This will be done on the Internet router. In Cisco IOS it is possible with a policy map and police setting where you can set set-dscp-transmit default. Apparently VPN and other things that may check the integrity of packets won’t mind the DSCP value being altered – it can happen anywhere along the route of the packet.

Boy applications these days are complicated! And those rare times when they go wrong really require a bunch of cooperating experts to figure things out. No one person holds all the expertise any longer.

My simplistic paradigm of its either the PC or the network had to make room for a new reality: it’s the web site in the cloud that did them in.

Could other web sites be similarly affected? Yes it certainly seems a possibility. So I now know to check for use of DSCP if a particular web site is not working, but all others are.

References and related
This Wikipedia article is a good description of DSCP: https://en.wikipedia.org/wiki/Differentiated_services

Categories
Admin CentOS Linux Security

The IT Detective Agency: WordPress login failure leads to discovery of ssh brute force attack

Intro
Yes my WordPress instance never gave any problems for years. Then one day my usual username/password wouldn’t log me in! One thing led to another until I realized I was under an ssh brute force attack from Hong Kong. Then I implemented a software that really helped the situation…

The details
Login failure

So like anyone would do, I double-checked that I was typing the password correctly. Once I convinced myself of that I went to an ssh session I had open to it. When all else fails restart, right? Except this is not Windows (I run CentOS) so there’s no real need to restart the server. There very rarely is.

Mysql fails to start
So I restarted mysql and the web server. I noticed mysql database wasn’t actually starting up. It couldn’t create a PID file or something – no space left on device.

No space on /
What? I never had that problem before. In an enterprise environment I’d have disk monitors and all that good stuff but as a singeleton user of Amazon AWS I suppose they could monitor and alert me to disk problems but they’d probably want to charge me for the privilege. So yeah, a df -k showed 0 bytes available on /. That’s never a good thing.

/var/log very large
So I ran a du -k from / and sent the output to /tmp/du-k so I could preview at my leisure. Fail. Nope, can’t do that because I can’t write to /tmp because it’s on the / partition in my simple-minded server configuration! OK. Just run du -k and scan results by eye… I see /var/log consumes about 3 GB out of 6 GB available which is more than I expected.

btmp is way too large
So I did an ls -l in /var/log and saw that btmp alone is 1.9 GB in size. What the heck is btmp? Some searches show it to be a log use to record ssh login attempts. What is it recording?

Disturbing contents of btmp
I learned that to read btmp you do a
> last -f btmp
The output is zillions of lines like these:

root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
root     ssh:notty    43.229.53.13     Mon Oct 26 14:56 - 14:56  (00:00)
...

I estimate roughly 3.7 login attempts per second. And it’s endless. So I consider it a brute force attack to gain root on my server. This estimate is based on extrapolating from a 10-minute interval by doing one of these:

> last -f btmp|grep ‘Oct 26 14:5’|wc

and dividing the result by 10 min * 60 s/min.

First approach to stop it
I’m at networking guy at heart and remember when you have a hammer all problems look like nails 😉 ? What is the network nail in this case? The attacker’s IP address of course. We can just make sure packets originating from that IP can’t get returned form my server, by doing one of these:

> route add -host 43.229.53.13 gw 127.0.0.1

Check it with one of these:

> netstat -rn

Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
43.229.53.13    127.0.0.1       255.255.255.255 UGH       0 0          0 lo
10.185.21.64    0.0.0.0         255.255.255.192 U         0 0          0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 eth0
0.0.0.0         10.185.21.65    0.0.0.0         UG        0 0          0 eth0

Then watch the btmp grow silent since now your server sends the reply packets to its loopback interface where they die.

Short-lived satisfaction
But the pleasure and pats on your back will be short-lived as a new attack from a new IP will commence within the hour. And you can squelch that one, too, but it gets tiresome as you stay up all night keeping up with things.

Although it wouldn’t bee too too difficult to script the recipe above and automate it, I decided it might even be easier still to find a package out there that does the job for me. And I did. It’s called

fail2ban

You can get it from the EPEL repository of CentOS, making it particularly easy to install. Something like:

$ yum install fail2ban

will do the trick.

I like fail2ban because it has the feel of a modern package. It’s written in python for instance and it is still maintained by its author. There are zillions of options which make it daunting at first.

To stop these ssh attacks in their tracks all you need is to create a jail.local file in /etc/fail2ban. Mine looks like this:

# DrJ - enable sshd monitoring
[DEFAULT]
bantime = 3600
# exempt CenturyLink
ignoreip = 76.6.0.0/16  71.48.0.0/16
#
[sshd]
enabled = true

Then reload it:

$ service fail2ban reload

and check it:

$ service fail2ban status

fail2ban-server (pid  28459) is running...
Status
|- Number of jail:      1
`- Jail list:   sshd

And most sweetly of all, wait a day or two and appreciate the marked change in the contents of btmp or secure:

support  ssh:notty    117.4.240.22     Mon Nov  2 07:05    gone - no logout
support  ssh:notty    117.4.240.22     Mon Nov  2 07:05 - 07:05  (00:00)
dff      ssh:notty    62.232.207.210   Mon Nov  2 03:38 - 07:05  (03:26)
dff      ssh:notty    62.232.207.210   Mon Nov  2 03:38 - 03:38  (00:00)
zhangyan ssh:notty    62.232.207.210   Mon Nov  2 03:38 - 03:38  (00:00)
zhangyan ssh:notty    62.232.207.210   Mon Nov  2 03:38 - 03:38  (00:00)
support  ssh:notty    117.4.240.22     Sun Nov  1 22:47 - 03:38  (04:50)
support  ssh:notty    117.4.240.22     Sun Nov  1 22:47 - 22:47  (00:00)
oracle   ssh:notty    180.210.201.106  Sun Nov  1 20:44 - 22:47  (02:03)
oracle   ssh:notty    180.210.201.106  Sun Nov  1 20:44 - 20:44  (00:00)
a        ssh:notty    180.210.201.106  Sun Nov  1 20:44 - 20:44  (00:00)
a        ssh:notty    180.210.201.106  Sun Nov  1 20:44 - 20:44  (00:00)
openerp  ssh:notty    123.212.42.241   Sun Nov  1 20:40 - 20:44  (00:04)
openerp  ssh:notty    123.212.42.241   Sun Nov  1 20:40 - 20:40  (00:00)
dff      ssh:notty    187.210.58.215   Sun Nov  1 20:36 - 20:40  (00:04)
dff      ssh:notty    187.210.58.215   Sun Nov  1 20:36 - 20:36  (00:00)
zhangyan ssh:notty    187.210.58.215   Sun Nov  1 20:36 - 20:36  (00:00)
zhangyan ssh:notty    187.210.58.215   Sun Nov  1 20:35 - 20:36  (00:00)
root     ssh:notty    82.138.1.118     Sun Nov  1 19:57 - 20:35  (00:38)
root     ssh:notty    82.138.1.118     Sun Nov  1 19:49 - 19:57  (00:08)
root     ssh:notty    82.138.1.118     Sun Nov  1 19:49 - 19:49  (00:00)
root     ssh:notty    82.138.1.118     Sun Nov  1 19:49 - 19:49  (00:00)
PlcmSpIp ssh:notty    82.138.1.118     Sun Nov  1 18:42 - 19:49  (01:06)
PlcmSpIp ssh:notty    82.138.1.118     Sun Nov  1 18:42 - 18:42  (00:00)
oracle   ssh:notty    82.138.1.118     Sun Nov  1 18:34 - 18:42  (00:08)
oracle   ssh:notty    82.138.1.118     Sun Nov  1 18:34 - 18:34  (00:00)
karaf    ssh:notty    82.138.1.118     Sun Nov  1 18:18 - 18:34  (00:16)
karaf    ssh:notty    82.138.1.118     Sun Nov  1 18:18 - 18:18  (00:00)
vagrant  ssh:notty    82.138.1.118     Sun Nov  1 17:13 - 18:18  (01:04)
vagrant  ssh:notty    82.138.1.118     Sun Nov  1 17:13 - 17:13  (00:00)
ubnt     ssh:notty    82.138.1.118     Sun Nov  1 17:05 - 17:13  (00:08)
ubnt     ssh:notty    82.138.1.118     Sun Nov  1 17:05 - 17:05  (00:00)
...

The attacks still come, yes, but they are so quickly snuffed out that there is almost no chance of correctly guessing a password – unless the attacker has a couple centuries on their hands!

Augment fail2ban with a network nail

Now in my case I had noticed attacks coming from various IPs around 43.229.53.13, and I’m still kind of disturbed by that, even after fail2ban was implemented. Who is that? Arin.net said that range is handled by apnic, the Asia pacific NIC. apnic’s whois (apnic.net) says it is a building in Mong Kok district of Hong Kong. Now I’ve been to Hong Kong and the Mong Kok district. It’s very expensive real estate and I think the people who own that subnet have better ways to earn money than try to pwn AWS servers. So I think probably mainland hackers have a backdoor to this Hong Kong network and are using it as their playground. Just a wild guess. So anyhow I augmented fail2ban with a network route to prevent all such attacks form that network:

$ route add -net 43.229.0.0/16 gw 127.0.0.1

A few words on fail2ban

How does fail2ban actually work? It manipulates the local firewall, iptables, as needed. So it will activate iptables if you aren’t already running it. Right now my iptables looks clean so I guess fail2ban hasn’t found anything recently to object to:

$ iptables -L

Chain INPUT (policy ACCEPT)
target     prot opt source               destination
f2b-sshd   tcp  --  anywhere             anywhere            multiport dports ssh
 
Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
 
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
 
Chain f2b-sshd (1 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere

Indeed, checking my messages file the most recent ban was over an hour ago – in the early morning:

Nov  2 03:38:49 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 62.232.207.210

And here is fail2ban doing its job since the log files were rotated at the beginning of the month:

$ cd /var/log; grep Ban messages

Nov  1 04:56:19 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 185.61.136.43
Nov  1 05:49:21 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 5.8.66.78
Nov  1 11:27:53 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 61.147.103.184
Nov  1 11:32:51 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 118.69.135.24
Nov  1 16:57:05 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 162.246.16.55
Nov  1 17:13:17 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 82.138.1.118
Nov  1 18:42:36 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 82.138.1.118
Nov  1 19:57:55 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 82.138.1.118
Nov  1 20:36:05 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 187.210.58.215
Nov  1 20:44:17 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 180.210.201.106
Nov  2 03:38:49 ip-10-185-21-116 fail2ban.actions[28459]: NOTICE [sshd] Ban 62.232.207.210

Almost forgot to mention
How did I free up space so I could still examine btmp? I deleted an older large log file, secure-20151011 which was about 400 MB. No reboot necessary of course. Mysql restarted successfully as did the web servers and I was back in business logging in to my WP site.

August 2017 update
I finally had to reboot my AWS instance after more than three years. I thought about my ssh usage pattern and decided it was really predictable: I either ssh from home or work, both of which have known IPs. And I’m simply tired of seeing all the hack attacks against my server. And I got better with the AWS console out of necessity.
Put it all together and you get a better way to deal with the ssh logins: simply block ssh (tcp port 22) with an AWS security group rule, except from my home and work.

Conclusion
The mystery of the failed WordPress login is examined in great detail here. The case was cracked wide open and the trails that were followed led to discovery of a brute force attempt to gain root access to the hosting server. Methods were implemented to ward off these attacks. An older log file was deleted from /var/log and mysql restarted. WordPress logins are working once again.

References and related info
fail2ban is documented in a wiki of uneven quality at www.fail2ban.org.
Another tool is DenyHosts. One of the ideas behind DenyHosts – its capability to share data – sound great, but look at this page: http://stats.denyhosts.net/stats.html. “Today’s data” is date-stamped July 11, 2011 – four years ago! So something seems amiss there. So it looks like development was suddenly abandoned four years ago – never a good sign for a security tool.

Categories
Admin

SD Card reader not working after Windows 10 upgrade

Intro
I was more than a little alarmed after an upgrade of my Dell Inspiron with built-in SD card reader failed to work properly after I upgraded from Windows 7 to Windows 10. After the upgrade I inserted an SD card into the reader and nothing happened in File Explorer! this led to some tense moments.

The details
Here’s file Explorer after inserting the SD card:

File Explorer
File Explorer

The DVD drive is nowhere to be found and the same for SD card.

But if I right-click on This PC and select manage it looks like this:

Disk Management

So that would make it seem that Disk 1 is removable media mapped to the E: drive and my DVD player is mapped to the D: drive. Interesting. So let’s try this in File Explorer (known as Windows Explorer in previous version of Windows). Type

E:

in the field where it says Quick access. Sure enough it magically appears:

Now with E: Drive mapped
Now with E: Drive mapped

And I can do the normal File Explorer operations with it.

I think there is a more permanent fix but for me I have no problem typing e: the few times I need to read an SD card.

Oh, and the DVD drive? It was there all along. I see it when I highlight This PC:

Now with DVD drive
Now with DVD drive

Conclusion
If you don’t see your SD card when running Windows 10 don’t panic. It may be there alright. Type E: in the Quick Access field. Or maybe D: or F: – depends on your PC’s configuration, which I’ve shown how to list above. I believe a more permanent fix involves re-installing or repairing a driver, but I haven’t had time to look into it. My approach will get you working quickly in a pinch, like, say, when you have to get the photos off your camera’s SD card because you need them right now.

References and related articles
This Microsoft Technet discussion was helpful to me. It was slow to load however.

Categories
Admin Apache Hosting Service IT Operational Excellence Linux Web Site Technologies

Scaling your apache to handle more requests

Intro
I was running an apache instance very happily with mostly default options until the day came that I noticed it was taking seconds to serve a simple web page – one that it used to serve in 50 ms or so. I eventually rolled up my sleeves to see what could be done about it. It seems that what had changed is that it was being asked to handle more requests than ever before.

The details
But the load average on a 16-core server was only at 2! sar showed no particular problems with either cpu of I/O systems. Both showed plenty of spare capacity. A process count showed about 258 apache processes running.

An Internet search helped me pinpoint the problem. Now bear in mind I use a version of apache I myself compiled, so the file layout looks different from the system-supplied apache, but the ideas are the same. What you need is to increase the number of allowed processes. On my server with its great capacity I scaled up considerably. These settings are in /conf/extra/httpd-mpm.conf in the compiled version. In the system-supplied version on SLES I found the equivalent to be /etc/apache2/server-tuning.conf. To begin with the key section of that file had these values:

<IfModule mpm_prefork_module>
    StartServers             5
    MinSpareServers          5
    MaxSpareServers         10
    MaxRequestWorkers      250
    MaxConnectionsPerChild   0
</IfModule>

(The correct section is <IfModule prefork.c> in the system-supplied apache).

I replaced these as follows:

<IfModule mpm_prefork_module>
    StartServers          256
    MinSpareServers        16
    MaxSpareServers       128
    ServerLimit          2048
    MaxClients           2048
    MaxRequestsPerChild  20000
</IfModule>

Note that ServerLimit has to be greater than or equal to MaxClients (thank you Apache developers!) or you get an error like this when you start apache:

WARNING: MaxClients of 2048 exceeds ServerLimit value of 256 servers,
 lowering MaxClients to 256.  To increase, please see the ServerLimit
 directive.

So you make this change, right, stop/start apache and what difference do you see? Probably none whatsoever! Because you probably forgot to uncomment this line in httpd.conf:

#Include conf/extra/httpd-mpm.conf

So remove the # at the beginning of that line and stop/start. If like me you’ve changed the usual diretory where the PID file and lock file get written in your httpd.conf file you may need this additional measure which I had to do in the httpd-mpm.conf file:

<IfModule !mpm_netware_module>
    #PidFile "logs/httpd.pid"
</IfModule>
 
#
# The accept serialization lock file MUST BE STORED ON A LOCAL DISK.
#
<IfModule !mpm_winnt_module>
<IfModule !mpm_netware_module>
#LockFile "logs/accept.lock"
</IfModule>
</IfModule>

In other words I commented out this file’s attempt to place the PID and lock files in a certain place because I have my own way of storing those and it was overwriting my choices!

But with all those changes put together it works much, much better than before and can handle more requests than ever.

Analysis
In creating a simple benchmark we could easily scale to 400 requests / second, and we didn’t really even try to push it – and this was before we changed any parameters. So why couldn’t 250 or so simultaneous processes handle more real world requests? I believe that if all clients were as fast as our server it could have handled them all. But the clients themselves were sometimes distant (thousands of miles) with slow or lossy connections. Then they need to acknowledge every packet sent by the web server and the web server has to wait around for that, unable to go on to the next client request! Real life is not like laboratory testing. As the waiting around bit requires next-to-no cpu the load average didn’t rise even though we had run up against a limit, the limit was an artificial application-imposed one, not a system-imposed resource constraint.

More analysis, what about threads?

Is this the only or best way to scale up your web server? Probably not. It’s probably the most practical however because you probably didn’t compile it with support for threads. I know I didn’t. Or if you’re using the system-provided package it probably doesn’t support threads. Find your httpd binary. Run this command:

$ ./httpd -l|grep prefork

If it returns:

  prefork.c

you have the prefork module and not the worker module and the above approach is what you need to do. To me a more modern approach is to scale by using threads – modern cpus are designed to run threads, which are kind of like light-weight processes. But, oh well. The gatekeepers of apache packages seem stuck in this simple-minded one process per request mindset.

Conclusion
My scaled-up apache is handling more requests than ever. I’ve documented how I increased the total process count.

References and related articles
How I compiled apache 2.4 and ran into (and resolved) a zillion errors seems to be a popular post!
The mystery of why we receive hundreds or even thousands of PAC file requests from each client every day remains unsolved to this day. That’s why we needed to scale up this apache instance – it is serving the PAC file. I first wrote about it three and a half years ago!04

Categories
Admin Linux

Upgrading your JDBC driver for all you HP SiteScope fans

Intro
HP SiteScope is a pretty good and not overly pricey infrastructure monitoring solution. We’ve used it for years. An unexpected Oracle error sent us scrambling to remember how the heck we installed an Oracle JDBC driver on HP SiteScope the last time we did it, which was eons ago. As with many very specific yet important things on the Internet, the documentation available on the Internet was pretty spotty. Here is my attempt to remedy that. These instructions are for Redhat Linux, though I would think similar considerations would apply to the Windows version.

The details
Well all our Oracle database monitors were working just fine for years. So when asked to monitor a new database we simply copied one of the old ones and appropriately changed the connect string. But a strange thing happened. We got this error:

ORA-28040

ORA-28040: No matching authentication protocol

So we spoke with a DBA. This new database, being newer, was running a much more current version, Oracle 12C. I became convinced that our several-years-old JDBC driver for SiteScope simply wasn’t compatible. The DBA searched the oracle site and found supporting evidence for that hypothesis. So how to upgrade?

The latest JDBC Drivers can be found here on Oracle’s Website. We selected JDBC Driver 12c Release 1 (12.1.0.2) and downloaded the ojdbc7.jar file.

The thing is that to download it you need some kind of Oracle developer account. Fortunately I had one from years back and it still worked. So we were able to download it.

Where does it go?
The other breakthrough I had was simply to remember after thinking about it what the old jdbc driver was called. Its name wasn’t anything like ojdbc.jar. No, it was classes12.jar!

Of course memories can be tricked. To confirm that that jar file looked basically right we did a

$ jar tvf classes12.jar

Sure enough, there were a bunch of lines for oracle/jdbc/blah, blah. Then out of curiosity I tried to check the actual classpath of the SiteScope process with something like this:

$ ps -ef|grep java|grep classes12

and sure enough, it highlighted a java process – clearly belonging to HP SiteScope – and the classes12.jar therein.

So memory confirmed.

Speculative next steps
This part is speculative and may not be necessary though it doesn’t seem to hurt anything. I wanted to maximize my chance of success the first time, rather than stopping/starting HP SiteScope multiple times, right? So I didn’t see a quick way to tell HP SiteScope that, hey, the new driver to use is ojdbc7.jar, not classes12.jar so I tried to force its hand. We moved the classes12.jar file out of its directory:

$ cd /opt/SiteScope/WEB-INF/lib; mv classes12.jar /tmp

and put the new jdbc file in that directory, and made a sym link from the old driver to the new one for good measure!

$ ln -s ojdbc7.jar classes12.jar

We tested if we could get away without stopping/restarting HP SiteScope. Nope. It didn’t pick up the new driver. So we were a little nervous. So we did the stop/start thing:

$ service hpss stop; service hpss start

It takes awhile, but…

Yes, the new monitor began working! Of course we were worried a bit about backwards compatibility between the 12C driver and the older version 11 databases, but those continued to work as well.

Conclusion
Installing a recent JDBC driver fixes the ORA-28040 error for our HP SiteScope installation. Was that sym link really necessary? I don’t know for sure, but I see that the java process still has classes12.jar in its path. It does not have ojdbc7.jar! There’s probably a way to modify the classpath, but I don’t know it. So in my case I’d be inclined to say Yes it was.

References and related articles
Oracle’s version 12C JDBC driver.
I rail against HP’s bureaucratic ways in this older posting.
My last HP SiteScope upgrade is documented here.