Categories
Admin Linux SLES

How to add private root CAs in SLES or Redhat

Intro
From time-to-time I run my own PKI infrastructure, namely issuing my own certificates form my private root CA. I wanted this root CA to be recognized by Linux utilities running on Suse Linux (SLES), in particular, lftp, which I was trying to use to access an ftps site, which itself is a post for another day.

The details
Let’s say you have your root certificate in the standard form like this example

-----BEGIN CERTIFICATE-----
MIIIPzCCBiegAwIBAgITfgAAAATHCoXJivwKLQAAAAAABDANBgkqhkiG9w0BAQsF\nADA2MQswCQYD
VQQGEwJERTENMAsGA1UEChMEQkFTRjEYMBYGA1UEAxMPQkFTRiBS\nb290IENBIDIxMB4XDTE3MDgxMDEyNDAwOFoXDTI4MDgxMDEyNTAwOFowXDETMBEG\nCgm
...
PEScyptUSAaGjS4JuxsNoL6URXYHxJsR0bPlet\nSct
-----END CERTIFICATE-----

Then you can put the certificate inline and within one script install it so that it permanently joins the other root CAs in /etc/ssl/certs with a script like this example:

DrJ_Root_CA="-----BEGIN CERTIFICATE-----\nMIIIPzCCBiegAwIBAgITfgAAAATHCoXJivwKLQAAAAAABDANBgkqhkiG9w0BAQsF\nADA2MQswCQYD
VQQGEwJERTENMAsGA1UEChMEQkFTRjEYMBYGA1UEAxMPQkFTRiBS\nb290IENBIDIxMB4XDTE3MDgxMDEyNDAwOFoXDTI4MDgxMDEyNTAwOFowXDETMBEG\nCgm
SJomT8ixkARkWA05FVDEUMBIGCgmSJomT8ixkARkWBEJBU0YxFjAUBgoJkiaJ\nk/IsZAEZFgZCQVNGQUQxFzAVBgNVBAMTDkJBU0YgU1VCIENBIDIzMIICIjAN
Bgkq\nhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAqrfoKxrCPCw/u2PBEaAwW/VHLxBw6JNi\n42F3EhXmligGb/Uu4kcWO016IGFatVrPhdAtShAqmTXis0w57hW
jn1Iptvo7rROY\nGPmH7aSW/fYM/x2Lln7NlltayXspWawqBzWzYGADodyjn/Z5TaLYaG8lajiabCM5\nUJDhlZ/SUR3xylqIIFaQK3k2twjeGoxobhbr9hJcQZ
fXF0V5FCSCzJExDYma6bs1\nZtyqP/yHaiOeWXGdnqM9EPfT8kmIC42ZXq7s2JZI5OUflJBbaebYEbuDad6Rh19E\nRchXABLe68+TF/4AZCw16iRwRgq/2Re2W
WPMtVomyZ2txvn51iizqBkdVGzIRklC\n3yIv5MRzDFTfG940/tSAomHsz+RdGbL+NCBeWSY+rnJQdExJ7bLXFLVsTNGL68lP\nMuYrkxYQKWRtVhvQCHsdd5E0
t9QR4iY1JLWQxq3GHy98tBbCGiKMpBbuj/9I/E6c\nGrikouv2QyNnCN34PXpUxTQmDj5LZGV9w2faqpwUBD2ZWsbyVSgvD8TcjdxzcMcj\nLBnYUaZ8wHFqUj2
DBahctfKQxA8Ptrzt1mDIGOQliZGDwrTVMECd+noQhTlF1eS+\nvNraV3dYRMymVxh58MPEaDJgwIRcBWAAOeBbZlyx76oskXdmjOiz5jqyoR5eweCE\ntS4jfM
EW6UECAwEAAaOCAx4wggMaMAsGA1UdDwQEAwIBhjAQBgkrBgEEAYI3FQEE\nAwIBADAdBgNVHQ4EFgQUdn7nwFGpb8uzpFVs5QWQcsA0Q6IwQwYDVR0gBDwwOjA
4\nBgwrBgEEAYGlZAMCAgEwKDAmBggrBgEFBQcCARYaaHR0cDovL3BraXdlYi5iYXNm\nLmNvbS9jcAAwGQYJKwYBBAGCNxQCBAweCgBTAHUAYgBDAEEwEgYDVR
0TAQH/BAgw\nBgEB/wIBADAfBgNVHSMEGDAWgBSS9auUcX38rmNVmQsv6DKAMZcmXDCCAQkGA1Ud\nHwSCAQAwgf0wgfqggfeggfSGgbZsZGFwOi8vL0NOPUJBU
0YlMjBSb290JTIwQ0El\nMjAyMSxDTj1DRFAsQ049UHVibGljJTIwS2V5JTIwU2VydmljZXMsQ049U2Vydmlj\nZXMsQ049Q29uZmlndXJhdGlvbixEQz1yb290
LERDPWJhc2YsREM9Y29tP2NlcnRp\nZmljYXRlUmV2b2NhdGlvbkxpc3Q/YmFzZT9vYmplY3RDbGFzcz1jUkxEaXN0cmli\ndXRpb25Qb2ludIY5aHR0cDovL3B
raXdlYi5iYXNmLmNvbS9yb290Y2EyMS9CQVNG\nJTIwUm9vdCUyMENBJTIwMjEuY3JsMIIBNgYIKwYBBQUHAQEEggEoNIIBJDCBuQYI\nKwYBBQUHMAKGgaxsZG
FwOi8vL0NOPUJBU0YlMjBSb290JTIwQ0ElMjAyMSxDTj1B\nSUEsQ049UHVibGljJTIwS2V5JTIwU2VydmljZXMsQ049U2VydmljZXMsQ049Q29u\nZmlndXJhd
GlvbixEQz1yb290LERDPWJhc2YsREM9Y29tP2NBQ2VydGlmaWNhdGU/\nYmFzZT9vYmplY3RDbGFzcz1jZXJ0aWZpY2F0aW9uQXV0aG9yaXR5MGYGCCsGAQUF\n
BzAChlpodHRwOi8vcGtpd2ViLmJhc2YuY29tL3Jvb3RjYTIxL1JPT1RDQTIxLnJ6\nLWMwMDctajY1MC5iYXNmLWFnLmRlX0JBU0YlMjBSb290JTIwQ0ElMjAyM
S5jcnQw\nDQYJKoZIhvcNAQELBQADggIBAClCvn9sKo/gbrEygtUPsVy9cj9UOQ2/CciCdzpz\nXhuXfoCIICgc0YFzCajoXBLj4V6zcYKjz8RndaLabDaaSQgj
phXFiZSBH8OII+cp\nTCWW1x+JElJXo9HB7Ziva2PeuU5ajXtvql5PegFYWdmgK2Q1QH0J2f1rr7B4nNGu\noyBi1TOSll+0yJApjx213lM9obt6hkXkjeisjcq
auMVh+8KloM0LQOTAD1bDAvpa\nVVN9wlbytvf4tLxHpvrxEQEmVtTAdVchuQV1QCeIbqIxW41l6nhE2TlPwEmTr+Cv\najMID/ebnc9WzeweyTddb6DSmn4mSc
okGpj8j8Z7cw173Yomhg1tEEfEzip+/Jx6\nd2qblZ9BUih9sHE8rtUBEPLvBZwr2frkXzL3f8D6w36LxuhcqJOmDaIPDpJMH/65\nAbYnJyhwJeGUbrRm3zVtA
5QHIiSHi2gTdEw+9EfyIhuNKS4FO/uonjJJcKBtaufl\nGFL6y0WegbS5xlMV9RwkM22R7sQkBbDTr+79MqJXYCGtbyX0JxIgOGbE4mxvdDVh\nmuPo9IpRc5Jl
pSWUa7HvZUEuLnUicRbfrs1PK/FBF7aSrJLoYprHPgP6421pl08H\nhhJXE9XA2aIfEkJ4BcKw0BqOP/PEScyptUSAaGjS4JuxsNoL6URXYHxJsR0bPlet\nSct
3\n-----END CERTIFICATE-----\n"
 
cd /etc/pki/trust/anchors/
echo -e -n $DrJ_Root_CA > DrJ_Root_CA.pem
c_rehash
update-ca-certificates

So the key commands are c_rehash and update-ca-certificates.

Usually SLES is similar to Redhat. But it seems to be different in this case.

This was tested on a SLES 12 SP3 system.

It copies the certificate to /etc/pki/trust/anchors, which by itself is insufficient. Then it creates some kind of hash symlink to the CA file and makes sure that this new certificate doesn’t get wiped out by subsequent system patching. That’s the purpose of the c_rehash and update-ca-certificates commands.

You may also see these hashes and certificates in /etc/ssl/certs. I’m not sure because that’s where I started with all this. But merely dropping the private root CA into /etc/ssl/certs is insufficient, I can say from experience!

Redhat
Redhat is better documented, but for completeness I include it here. You have your inline certificate as in the SLES script, then following that:

...
cd /etc/pki/ca-trust/source/anchors/
echo -e -n $DrJ_Root_CA > DrJ_Root_CA.pem
update-ca-trust

So update-ca-trust is the key command for Redhat Linux. This was tested on Redhat Linux v 7.6.

lftp usage tip with a private CA
If like me you were doing this work in conjunction with running ftps using a certificate signed by a private CA, and want your ftp client, lftp, to not complain about the unrecognized CA, then this tip will help.

After initiating your lftp and sending the username and password, you can send this command
$ ssl:ca-file <path-to-your-private-CA-file>
lftp is so flexible it offers many other ways to do this as well. But this is the one I use.

Conclusion
We show how to add your own root CA to a SLES 12 system. I did not find a good reference for this informaiton anywhere on the Internet.

References and related
My favorite openssl commands.

The basics of working with cipher settings

For Reedhat/CentOS I am evaluating this blog post on the proper way to add your own private CA: https://www.happyassassin.net/2015/01/14/trusting-additional-cas-in-fedora-rhel-centos-dont-append-to-etcpkitlscertsca-bundle-crt-or-etcpkitlscert-pem/

For the Redhat approach I used this blog post: https://www.happyassassin.net/2015/01/14/trusting-additional-cas-in-fedora-rhel-centos-dont-append-to-etcpkitlscertsca-bundle-crt-or-etcpkitlscert-pem/

Categories
Admin Linux Network Technologies SLES

Linux tip: how to enable remote syslog on SLES

Intro
I write this knowing I still don’t know anything to speak of about syslog, but, sometimes you gotta act without knowing. I needed to send syslog to somewhere in a big hurry so I figured out the absolute minimum I needed to do to get it running on one of my other systems.

The details
This all started because of a deficiency in the F5 ASM. At best it’s do slow when looking through the error log. But in particular there was one error that always timed out when I tried to bring up the details, a severity 5 error, so it looked pretty important. Worse, local logging, even though it is selected, also does not work – the /var/log/asm file exists but contains basically nothing of interest. I suppose there is some super-fancy and complicated MySQL command you could run to view the logs, but that would take a long time to figure out.

So for me the simplest route was to enable remote syslog on a Linux server and send the ASM logging to it. This seems to be working, by the way.

The minimal steps
Again, this was for Suse Enterprise Linux running syslog-ng.

  1. modify /etc/sysconfig/syslog as per the next step
  2. SYSLOGD_PARAMS=”-r”
  3. modify /etc/syslog-ng/syslog-ng.conf as per the next step
  4. uncomment this line: udp(ip(“0.0.0.0”) port(514));
  5. launch yast (I use curses-based yast [no X-Windows] which is really cantankerous)
  6. go to Security and Users -> Firewall -> Allowed services -> Internal Zone -> Advanced
  7. add udp port 514 as additional allowed Ports in internal zone and save it
  8. service syslog stop
  9. service syslog start
  10. You should start seeing entries in /var/log/localmessages as in this suitably anonymized example (I added a couple line breaks for clarity:
Jul 27 14:42:22 f5-drj-mgmt ASM:"7653503868885627313","50.17.188.196","/Common/drjohnstechtalk.com_profile","blocked","/drjcrm/bi/tjhmore345","0","Illegal URL,Attack signature detected","200021075","Automated client access ""curl""","US","<?xml version='1.0' encoding='UTF-8'?><BAD_MSG><violation_masks><block>44e7f1ffebff2dfb-8000000000000000</block><alarm>44f7f1ffebff2dfb-8000000000000000</alarm><learn>44e7f1ffe3ff2dfb-8000000000000000</learn><staging>0000000000000000-0000000000000000</staging></violation_masks><request-violations><violation><viol_index>42</viol_index><viol_name>VIOL_ATTACK_SIGNATURE</viol_name><context>request</context><sig_data><sig_id>200021075</sig_id>
<blocking_mask>7</blocking_mask><kw_data><buffer>VXNlci1BZ2VudDogY3VybC83LjE5LjcgKHg4Nl82NC1yZWRoYXQtbGludXgtZ251KSBsaWJjdXJsLzcuMTkuNyBOU1MvMy4yNy4xIHpsaWIvMS4yLjMgbGliaWRuLzEuMTggbGlic3NoMi8xLjQuMg0KSG9zdDogYWctaW50ZWw=</buffer>
<offset>0</offset><length>16</length></kw_data></sig_data></violation><violation><viol_index>38</viol_index>
<viol_name>VIOL_URL</viol_name></violation></request-violations></BAD_MSG>","GET /drjcrm/bi/tjhmore345 HTTP/1.1\r\nUser-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.27.1 zlib/1.2.3 libidn/1.18 libssh2/1.4.2\r\nHost: drjohnstechtalk.com\r\nAccept: */*\r\n\r\n"

Observations
Interestingly, there is no syslogd on this particular system, and yet the “-r” flag is designed for syslogd – it’s what turns it into a remote syslogging daemon. And yet it works.

It’s easy enough to log these messages to their own file, I just don’t know how to do it yet because I don’t need to. I learn as I need to. just as I learned enough to publish this tip.

Conclusion
We have demonstrated activating the simplest possible remote syslogger on Suse Linux Enterprise Server.

Categories
Raspberry Pi SLES Web Site Technologies

Pi-hole: it’s as easy as pi to get rid of your advertisements

Intro
I learned about pi-hole from Bloomberg Businessweek of all places. Seems right up my alley – uses Raspberry Pi in your home to get rid of advertisements. Turns out it was too easy and I don’t have much to contribute except my own experiences with it!

The details
When I read about it I got to thinking big picture and wondered what would prevent us from running an enterprise version of this same thing? Well, large enerprises don’t normally run production critical applications like DNS servers (which this is, by the way) on Raspberry Pis, which is not the world’s most stable hardware! But first I had to try it at home just to learn more about the technology.

pi-hole admin screen

I was surprised just how optimized it was for the Raspberry Pi, to the neglect of other systems. So the idea of using an old SLES server is out the window.

But I think I got the essence of the idea. It replaces your DNS server with a custom one that resolves normal queries for web sites the usual way, but for DNS queries that would resolve to an Ad server, it clobbers the DNS and returns its own IP address. Why? So that it can send you a harmless blank image or whatever in place of an Internet ad.

You know those sites that obnoxiously throw up those auto-playing videos? That ain’t gonna happen any more when you run pi-hole.

You have to be a little adept at modifying your home router, but they even have a rough tutorial for that.

Installation
For the record on my Rspberry Pi I only did this:
$ sudo su ‐
$ curl ‐sSL https://install.pi‐hole.net | bash

It prompted me for a few configuration details, but the answers were obvious. I chose Google DNS servers because I have a long and positive history using them.

You can see that it installs a bunch of packages – surprisingly many considering how simple in theory the thing is.

Test it
On your Raspberry Pi do a few test resolutions:

$ dig google.com @localhost # should look like it normally does
$ dig pi.hole # should return the IP of your Raspberry Pi
$ dig adservices.google.com # I gotta check this one. Should return IP address of your Pi

It runs a little web server on your Pi so the Pi acts as adservices.google.com and just serves out some white space instead of the ad you would have gotten.

Linksys router
Another word about the home router DHCP settings. You have the option to enter DNS server. So I put the IP address of my raspberry pi, 192.168.1.119. What I expected is that this is the DNS server that would be directly handed out to the DHCP clients on my home network. But that is not the case. Instead it still hands out itself, 192.168.1.1 as DNS server. But in turn it uses the raspberry PI for its resolution. This through me when I did an ipconfig /all on my Windows 10 and didn’t see the DNS server I expected. But it wa all working. About 10% of my DNS queries were pi-holed (see picture of my admin screen above).

I guess pi-hole is run by fanatics, because it works surprisingly well. Those complex sites still worked, like cnn.com, cnet.com. But they probably load faster without the ads.

Two months check up

I checked back with pihole. I know a DNS server is running. The dashboard is broken – the sections just have spinning circle instead of data. It’s already asking me to upgrade to v 3.3.1. I run pihole -up to do the upgrade.

Another little advantage
I can now ssh to my pi by specifying the host as pi.hole – which I can actually remember!

Idea for enterprise
finally, the essence of the idea probably could be ported over to an enterprise. In my opinion the secret sauce are the lists of domain names to clobber. There are five or six of them. Some have 50,000 entries. So you’d probably need a specialized DNS server rather than the default ISC BIND. I remember running a specialized DNS server like that when I ran Puremessage by Sophos. It was optimized to suck in real-time blacklists and the like. I have to dig through my notes to see what we ran. I’m sure it wasn’t dnsmasq, which is what pi-hole runs on the Raspberry Pi! But with these lists and some string manipulation and a simple web server I’d think it’d be possible to replicate in enterprise environment. I may never get the opportunity, more for lack of time than for lack of ability…

Conclusion
Looking for a rewarding project for your Raspberry Pi? Spare yourself Internet advertisements at home by putting it to work.

References and related
The pi-hole web site: https://pi-hole.net/
Another Raspberry Pi project idea: monitor your cable modem and restart it when it goes south.

Categories
Linux SLES Web Site Technologies

Compiling curl and openssl on Redhat Linux

Intro
I have an ancient Redhat system which I’m not in a position to upgrade. I like to use curl to test web sites, but it’s getting to the point that my ancient version has no SSL versions in common with some secure web sites. I desperately wanted to upgrade curl while leaving the rest of the system as is. Is it even possible? How would you do it? All these things and more are explained in today’s riveting blog post.

The details
Redhat version
I don’t know the proper command so I do this:
$ cat /etc/system-release

ed Hat Enterprise Linux Server release 6.6 (Santiago)

Current curl version
$ ./curl ‐‐version

curl 7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.16.2.3 Basic ECC zlib/1.2.3 libidn/1.18 libssh2/1.4.2

Limited set of SSL/TLS protocols
$ curl ‐help

...
 -2/--sslv2         Use SSLv2 (SSL)
 -3/--sslv3         Use SSLv3 (SSL)
...
 -z/--time-cond <time> Transfer based on a time condition
 -1/--tlsv1         Use TLSv1 (SSL)
...

New version of curl

curl 7.55.1 (x86_64-unknown-linux-gnu) libcurl/7.55.1 OpenSSL/1.1.0f zlib/1.2.3

New SSL options

     --ssl           Try SSL/TLS
     --ssl-allow-beast Allow security flaw to improve interop
     --ssl-no-revoke Disable cert revocation checks (WinSSL)
     --ssl-reqd      Require SSL/TLS
 -2, --sslv2         Use SSLv2
 -3, --sslv3         Use SSLv3
...
     --tls-max <VERSION> Use TLSv1.0 or greater
     --tlsauthtype <type> TLS authentication type
     --tlspassword   TLS password
     --tlsuser <name> TLS user name
 -1, --tlsv1         Use TLSv1.0 or greater
     --tlsv1.0       Use TLSv1.0
     --tlsv1.1       Use TLSv1.1
     --tlsv1.2       Use TLSv1.2
     --tlsv1.3       Use TLSv1.3

Now that’s an upgrade! How did we get to this point?

Well, I tried to get a curl RPM – seems like the appropriate path for a lazy system administrator, right? Well, not so fast. It’s not hard to find an RPM, but trying to install one showed a lot of missing dependencies, as in this example:
$ sudo rpm ‐i curl‐minimal‐7.55.1‐2.0.cf.fc27.x86_64.rpm

warning: curl-minimal-7.55.1-2.0.cf.fc27.x86_64.rpm: Header V4 DSA/SHA1 Signature, key ID b56a8bac: NOKEY
error: Failed dependencies:
        libc.so.6(GLIBC_2.14)(64bit) is needed by curl-minimal-7.55.1-2.0.cf.fc27.x86_64
        libc.so.6(GLIBC_2.17)(64bit) is needed by curl-minimal-7.55.1-2.0.cf.fc27.x86_64
        libcrypto.so.1.1()(64bit) is needed by curl-minimal-7.55.1-2.0.cf.fc27.x86_64
        libcurl(x86-64) >= 7.55.1-2.0.cf.fc27 is needed by curl-minimal-7.55.1-2.0.cf.fc27.x86_64
        libssl.so.1.1()(64bit) is needed by curl-minimal-7.55.1-2.0.cf.fc27.x86_64
        curl conflicts with curl-minimal-7.55.1-2.0.cf.fc27.x86_64

So I looked at the libcurl RPM, but it had its own set of dependencies. Pretty soon it looks like a full-time job to get this thing compiled!

I found the instructions mentioned in the reference, but they didn’t work for me exactly like that. Besides, I don’t have a working git program. So here’s what I did.

Compiling openssl

I downloaded the latest openssl, 1.1.0f, from https://www.openssl.org/source/ , untar it, go into the openssl-1.1.0f directory, and then:

$ ./config ‐Wl,‐‐enable‐new‐dtags ‐‐prefix=/usr/local/ssl ‐‐openssldir=/usr/local/ssl
$ make depend
$ make
$ sudo make install

So far so good.

Compiling zlib
For zlib I was lazy and mostly followed the other guy’s commands. Went something like this:
$ lib=zlib-1.2.11
$ wget http://zlib.net/$lib.tar.gz
$ tar xzvf $lib.tar.gz
$ mv $lib zlib
$ cd zlib
$ ./configure
$ make
$ cd ..
$ CD=$(pwd)

No problems there…

Compiling curl
curl was tricky and when I followed the guy’s instructions I got the very problem he sought to avoid.

vtls/openssl.c: In function ‘Curl_ossl_seed’:
vtls/openssl.c:276: error: implicit declaration of function ‘RAND_egd’
make[2]: *** [libcurl_la-openssl.lo] Error 1
make[2]: Leaving directory `/usr/local/src/curl/curl-7.55.1/lib'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/usr/local/src/curl/curl-7.55.1/lib'
make: *** [all-recursive] Error 1

I looked at the source and decided that what might help is to add a hint where the openssl stuff could be found.

Backing up a bit, I got the source from https://curl.haxx.se/download.html. I chose the file curl-7.55.1.tar.gz. Untar it, go into the curl-7.55.1 directory,
$ ./buildconf
$ PKG_CONFIG_PATH=/usr/local/ssl/lib/pkgconfig LIBS=”‐ldl”

and then – here is the single most important point in the whole blog – configure it thusly:

$ ./configure ‐‐with‐zlib=$CD/zlib ‐‐disable‐shared ‐‐with‐ssl=/usr/local/ssl

So my insight was to add the ‐‐with‐ssl=/usr/local/ssl to the configure command.

Then of course you make it:

$ make

and maybe even install it:

$ make install

This put curl into /usr/local/bin. I actually made a sym link and made this the default version with this kludge (the following commands were run as root):

$ cd /usr/bin; mv curl{,.orig}; ln ‐s /usr/local/bin/curl

That’s it! That worked and produced a working, modern curl.

By the way it mentions TLS1.3, but when you try to use it:

$ curl ‐i ‐k ‐‐tlsv1.3 https://drjohnstechtalk.com/

curl: (4) OpenSSL was built without TLS 1.3 support

It’s a no go. But at least TLS1.2 works just fine in this version.

One other thing – put shared libraries in a common area
I copied my compiled curl from Redhat to a SLES 11 SP 3 system. It didn’t quite run. Only thing is, it was missing the openssl libraries. So I guess it’s also important to copy over

libssl.so.1.1
libcrypto.so.1.1

to /usr/lib64 from /usr/local/lib64.

Once I did that, it worked like a charm!

Conclusion
We show how to compile the latest version of openssl and curl on an older Redhat 6.x OS. The motivation for doing so was to remain compatible with web sites which are already or soon dropping their support for TLS 1.0. With the compiled version curl and openssl supports TLS 1.2 which should keep it useful for a long while.

References and related
I closely followed the instructions in this stackoverflow post: https://stackoverflow.com/questions/44270707/cant-build-latest-libcurl-on-rhel-7-3#44297265
openssl source: https://www.openssl.org/source/
curl sources: https://curl.haxx.se/download.html
Here’s a web site that only supports TLS 1.2 which shows the problem: https://www.askapache.com/. You can see for yourself on ssllabs.com

Categories
Admin Apache Security SLES Web Site Technologies

RSA Web Agent Installation: what might go wrong

Intro
As usual I ran into a few problems installing the RSA Web agent for a client. With this documentation I hope to jog my memory for my next installation or help someone else out who is experiencing the same problems.

The details
I was installing it on on SLES 11 system, Web Agent version 7.1.

So I ran the CD/install program as root and went through the prompts for the initial setup. I tried to laucnh firefox at the end, which it couldn’t, but I don’t think that is significant. I start up the web server. The error.log file begins to fill up! It looks like this:

acestatus: error while loading shared libraries: libaceclnt.so: cannot open shared object file: No such file or directory
rpc_server 2389 started by 2379
RSALogoffCookieService: error while loading shared libraries: libaceclnt.so: cannot open shared object file: No such file o
r directory
AceShutdown try to kill process 2389
signal 15 received
acestatus: error while loading shared libraries: libaceclnt.so: cannot open shared object file: No such file or directory
RSALogoffCookieService: error while loading shared libraries: libaceclnt.so: cannot open shared object file: No such file o
r directory
start child 2403
[Mon Aug 18 16:17:55 2014] [notice] Apache/2.2.27 (Unix) mod_rsawebagent/7.1.0[639] DAV/2 PHP/5.2.14 with Suhosin-Patch con
figured -- resuming normal operations
Cannot register service: RPC: Authentication error; why = Client credential too weak
unable to register (300760, 1).child 2403 end
start child 2409
Cannot register service: RPC: Authentication error; why = Client credential too weak
unable to register (300760, 1).child 2409 end
start child 2410
Cannot register service: RPC: Authentication error; why = Client credential too weak
unable to register (300760, 1).child 2410 end
start child 2411
Cannot register service: RPC: Authentication error; why = Client credential too weak
unable to register (300760, 1).child 2411 end
start child 2412
Cannot register service: RPC: Authentication error; why = Client credential too weak
unable to register (300760, 1).child 2412 end
start child 2413
...

Not good.

So I eventually realize that my web server is running as user wwwrun and the RSA web agent stuff I installed as root and its directory, rsawebagent, is owned by userid 40959 – there was no attempt by the installer to match that up to the user the web server runs as. So I try a fix by hand like this:

$ chown -R wwwrun rsawebagent

Success! That succeeds in getting rid of the repeating RPC error. Now the error.log file has only a modest level of errors:

acestatus: error while loading shared libraries: libaceclnt.so: cannot open shared object file: No such file or directory
rpc_server 27766 started by 27756
RSALogoffCookieService: error while loading shared libraries: libaceclnt.so: cannot open shared object file: No such file or directory
AceShutdown try to kill process 27766
signal 15 received
acestatus: error while loading shared libraries: libaceclnt.so: cannot open shared object file: No such file or directory
RSALogoffCookieService: error while loading shared libraries: libaceclnt.so: cannot open shared object file: No such file or directory
start child 27780
[Mon Aug 18 16:25:00 2014] [notice] Apache/2.2.27 (Unix) mod_rsawebagent/7.1.0[639] DAV/2 PHP/5.2.14 with Suhosin-Patch configured -- resuming normal operations

But the thing is, it actually, mostly kind of, seems to work. You see a promising Authentication Succeeded screen in your browser after logging in to the home page. But then it directs you back to the RSA login screen. I was actually stuck on this point for a long time.

The error.log file also looks encouraging at this point:

[Mon Aug 18 16:27:28 2014] [notice] Authentication succeeded User: drj.

My insight today is to tackle the libaceclnt.so problem. I actually ran a trace of the startup to see where it was looking for that file so I could put it there. It was looking in system directories like these:

[pid 31974] open("/usr/lib64/tls/x86_64/libaceclnt.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 31974] stat("/usr/lib64/tls/x86_64", 0x7fff93b721b0) = -1 ENOENT (No such file or directory)
[pid 31974] open("/usr/lib64/tls/libaceclnt.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 31974] stat("/usr/lib64/tls", 0x7fff93b721b0) = -1 ENOENT (No such file or directory)
[pid 31974] open("/usr/lib64/x86_64/libaceclnt.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 31974] stat("/usr/lib64/x86_64", 0x7fff93b721b0) = -1 ENOENT (No such file or directory)
[pid 31974] open("/usr/lib64/libaceclnt.so", O_RDONLY) = -1 ENOENT (No such file or directory)
...

So I decided to make a soft link to it from /usr/lib64 such that:

 libaceclnt.so -> /usr/local/apache202/rsawebagent/libaceclnt.so

Note that my ServerRoot was /usr/local/apache202.

Now when I start up my apache202 instance I have this in error.log:

rpc_server 28874 started by 28860
grep RSALogoffCookieService /proc/*/cmdline | sed 's/\/cmdline.*\/proc\// /g' | sed 's/\/cmdline.*/ /'  | sed 's/.*\/proc\// /' | sort -u
start child 28877
grep RSALogoffCookieService /proc/*/cmdline | sed 's/\/cmdline.*\/proc\// /g' | sed 's/\/cmdline.*/ /'  | sed 's/.*\/proc\// /' | sort -u
AceShutdown try to kill process 28874
signal 15 received
grep RSALogoffCookieService /proc/*/cmdline | sed 's/\/cmdline.*\/proc\// /g' | sed 's/\/cmdline.*/ /'  | sed 's/.*\/proc\// /' | sort -u
start child 28913
[Mon Aug 18 16:36:23 2014] [notice] Apache/2.2.27 (Unix) mod_rsawebagent/7.1.0[639] DAV/2 PHP/5.2.14 with Suhosin-Patch configured -- resuming normal operations

And best of all – it actually works!

I get the RSA authentication page initially. I log on and get redirected to the actual server home page. The access.log file records my username in the access line.

Additional error observed months later
You know that symptom I described above? You see a promising Authentication Succeeded screen in your browser after logging in to the home page. But then it directs you back to the RSA login screen. My web server had been running fine for over a month when all of a sudden it behaved that way again. Confounding. So I put on my big boy pants and did an strace. Nothing popped out at me, but I was struck by frequent access to an htdocs filepath. What’s so unusual about that? I don’t use htdocs in my configurations! So where was that coming from? I re-checked my configuration. OK, this is embarrassing. I have a sweeping include statement in my top-level httpd.conf file:

# pick up all vhosts
Include conf/vhosts/*.conf

It seemed like a good idea at the time. In my conf/vhosts directory I actually had two conf files, my rsaauth.conf but also a dflt.conf!! And the dflt.conf had the references to htdocs, but no references to the RSA authentication. So it was being used to establish the location of the home directory and the other conf file to fix the authentication type, I guess.

I removed the dflt.conf file, restarted and everything began to work once again. Whew!

RPC errors returned after a few months
After a year or so of running the RPC errors mentioned above returned and I never could figure out why and I no longer needed this service so I didn’t pursue it.

Conclusion
A few errors were observed installing RSA Web Agent v 7.1 on SLES Linux. I had had similar problems on Redhat as well. I finally found some solutions and now they’re ready to use it!

References
This write-up is partially related to my blog post of installing multiple apache instances.

Categories
Admin Internet Mail SLES

The IT Detecive Agency: emails began piling up this week, no obvious cause

Intro
Today I had my choice of problems I could highlight, but I like this one the best. Our mail server delivers email to a wide variety of recipients. All was going well and it ran pretty much unattended until this week when it didn’t go so well. Most emails were getting delivered, but more and more were starting to pile up in the queues. This is the story of how we unraveled the mystery.

The details
It’s best to work from examples I think. I noticed emails to me.com were being refused delivery as well as emails to rnbdesign.com. The latter is a smaller company so we heard from them the usual story that we’re the only ones who can’t send to them.

So I forced delivery with verbose logging. I’m running sendmail, so that looks like this:

> sendmail -qRrnbdesign.com -Cconfig_file -v

That didn’t work out, producing a no route to host type of error. I did a DNS lookup by hand. That showed one set of results, while sendmail was connecting to an entirely different IP address. How could that be??

I was at a loss so I do what I do when I’m desperate: strace. That looks like this:

> strace -f sendmail -qRrnbdesign.com -Cconfig_file -v > /tmp/strace 2>&1

That produced 12,000 lines of output. All the system calls that the process and any of its forked processes invoke. Is that too much to comb through by hand? No, not at all, not when you begin to see the patterns.

I pored over the trace, not knowing what most of it meant, but looking for especially any activity regarding networking and DNS. Around line 6,000 I found it. There was mention of nscd.

For the unaware the use of nscd (nameserver caching daemon) might seem innocent enough, or even good-intentioned. What could be wrong with caching frequently used DNS results? The only issue is that it doesn’t work right! nscd derives from UC Berkeley Unix code and has never been supported. I didn’t even like it when I was running SunOS. It caches the DNS queries but ignores TTLs. This is fatal for mail servers or just about anything you can think of, especially on servers that are infrequently booted as mine are.

I stopped nscd right away:

> service nscd stop

and re-ran the sendmail queue runner (same command as above). The rnbdesign.com emails flowed out instantly! Soon hundreds of stuck emails were flushed out.

Of course for good measure nscd had to be removed from the startup sequence:

> chkconfig nscd off

An IT pro always keeps unsolved mysteries in his mind. This time I knew I also had in hand the solution an earlier-documented mystery about email to paladinny.com.

Conclusion
nscd might show up in your SLES or OpenSuse server. I strongly suggest to disable it before you wind up with old DNS values and an extremely hard-to-debug issue.

Case closed!

Categories
Admin DNS Internet Mail SLES

Strange problem with email to paladinny.com

Intro
This is probably the most obscure of all postings I will ever do – it’s really just opening up my private journal to the Internet, which helps me when I need to recall how I fixed something.

So the story is that I’m having trouble sending email to anyone in the domain paladinny.com, and I just couldn’t figure out why.

The details
With my sendmail config I finally rolled up my sleeves, and did some debugging, even though I am pressed for time. Start up our sendmail debugging session:

> sendmail -Cconfig_file.cf -bt -d35.9

This produces a lot of blah, blah, configuration settings, blah, blah, and finally a sort of sendmail debugging shell. So let’s test a good “normal” domain:

> 3,0 test@gmail.com

canonify           input: test @ gmail . com
Canonify2          input: test < @ gmail . com >
Canonify2        returns: test < @ gmail . com . >
canonify         returns: test < @ gmail . com . >
parse              input: test < @ gmail . com . >
Parse0             input: test < @ gmail . com . >
Parse0           returns: test < @ gmail . com . >
ParseLocal         input: test < @ gmail . com . >
ParseLocal       returns: test < @ gmail . com . >
Parse1             input: test < @ gmail . com . >
Mailertable        input: < gmail . com > test < @ gmail . com . >
Mailertable        input: gmail . < com > test < @ gmail . com . >
Mailertable      returns: test < @ gmail . com . >
Mailertable      returns: test < @ gmail . com . >
SmartTable         input: test < @ gmail . com . >
SmartTable       returns: test < @ gmail . com . >
MailerToTriple     input: < > test < @ gmail . com . >
MailerToTriple   returns: test < @ gmail . com . >
Parse1           returns: $# esmtp $@ gmail . com . $: test < @ gmail . com . >
parse            returns: $# esmtp $@ gmail . com . $: test < @ gmail . com . >

and then this problem domain:

> 3,0 test@paladinny.com

canonify           input: test @ paladinny . com
Canonify2          input: test < @ paladinny . com >
Canonify2        returns: test < @ paladinny . no-ip . biz . >
canonify         returns: test < @ paladinny . no-ip . biz . >
parse              input: test < @ paladinny . no-ip . biz . >
Parse0             input: test < @ paladinny . no-ip . biz . >
Parse0           returns: test < @ paladinny . no-ip . biz . >
ParseLocal         input: test < @ paladinny . no-ip . biz . >
ParseLocal       returns: test < @ paladinny . no-ip . biz . >
Parse1             input: test < @ paladinny . no-ip . biz . >
Mailertable        input: < paladinny . no-ip . biz > test < @ paladinny . no-ip . biz . >
Mailertable        input: paladinny . < no-ip . biz > test < @ paladinny . no-ip . biz . >
Mailertable        input: paladinny . no-ip . < biz > test < @ paladinny . no-ip . biz . >
Mailertable      returns: test < @ paladinny . no-ip . biz . >
Mailertable      returns: test < @ paladinny . no-ip . biz . >
Mailertable      returns: test < @ paladinny . no-ip . biz . >
SmartTable         input: test < @ paladinny . no-ip . biz . >
SmartTable       returns: test < @ paladinny . no-ip . biz . >
MailerToTriple     input: < > test < @ paladinny . no-ip . biz . >
MailerToTriple   returns: test < @ paladinny . no-ip . biz . >
Parse1           returns: $# esmtp $@ paladinny . no-ip . biz . $: test < @ paladinny . no-ip . biz . >
parse            returns: $# esmtp $@ paladinny . no-ip . biz . $: test < @ paladinny . no-ip . biz . >

I have to look more into what Canonify2 does. But this gives me an idea: force the mailertable to handle paladinny . no-ip . biz the way I want it to, namely:

paladinny.no-ip.biz relay:barracuda.cblconsulting.com

because in DNS my DNS server returns this funny result:

> dig mx paladinny.com

; <<>> DiG 9.6-ESV-R7-P3 <<>> mx paladinny.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17559
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 0
 
;; QUESTION SECTION:
;paladinny.com.                 IN      MX
 
;; ANSWER SECTION:
paladinny.com.          351     IN      CNAME   paladinny.no-ip.biz.
 
;; AUTHORITY SECTION:
no-ip.biz.              60      IN      SOA     nf1.no-ip.com. hostmaster.no-ip.com. 2052775595 600 300 604800 600
 
;; Query time: 30 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jan 18 08:53:49 2013
;; MSG SIZE  rcvd: 121

whereas Google’s public DNS says this, which looks like the intended result:

> dig mx paladinny.com @8.8.8.8

; <<>> DiG 9.6-ESV-R7-P3 <<>> mx paladinny.com @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3749
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
 
;; QUESTION SECTION:
;paladinny.com.                 IN      MX
 
;; ANSWER SECTION:
paladinny.com.          1800    IN      MX      10 barracuda.cblconsulting.com.
 
;; Query time: 236 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Fri Jan 18 08:55:42 2013
;; MSG SIZE  rcvd: 71

So at least we know where that odd paladinny.no-ip.biz comes from, sort of. It comes from my nameserver, but where it got that answer from I have no idea. It doesn’t come from the authoritative nameservers:

> dig mx paladinny.com @dns1.name-services.com.

; <<>> DiG 9.6-ESV-R7-P3 <<>> mx paladinny.com @dns1.name-services.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45704
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
 
;; QUESTION SECTION:
;paladinny.com.                 IN      MX
 
;; ANSWER SECTION:
paladinny.com.          1800    IN      MX      10 barracuda.cblconsulting.com.
 
;; Query time: 82 msec
;; SERVER: 98.124.192.1#53(98.124.192.1)
;; WHEN: Fri Jan 18 08:59:50 2013
;; MSG SIZE  rcvd: 71

A CNAME is not an MX record, so why my nameserver is returning an answer (ANSWER: 1)when queried for the MX record when all it thinks it has is a CNAME seems to be an out-and-out error.

And putting the resolved name in the mailertable is also not normal. Normally you put the domain itself, as in:

paladinny.com relay:barracuda.cblconsulting.com

and of course that’s the first thing I tried, but it has no effect whatsoever.

February Update and Conclusion
The mystery was solved when a whole bunch of email deliveries started failing on my system and I was forced to do some serious debugging. Long story short my SLES system was regrettably running nscd, the nameserver caching daemon. I didn’t even bother to check paladinny.com. So many other things cleared up when I killed it I’m sure it was the cause of the paladinny.com issue as well. This is all described in this post.

Categories
Linux Network Technologies SLES TCP/IP

Ethernet Bridging on the cheap. Fail. Then Success with OLTV

Intro
Some experiments just don’t work out. I became curious about a technology that has various names: ethernet bridging, wide-area VLANs, OTV, L2TP, etc. It looked like it could be done on the cheap, but that didn’t pan out for me. But later on we got hold of high-end gear that implements OTV and began to get it to work.

The details
What this is is the ability to extend a subnet to a remote location. How cool is that? This can be very useful for various reasons. A disaster recovery center, for instance, which uses the same IP addressing. A strategic decision to move some, but not all equipment on a particular LAN to another location, or just for the fun of it.

As with anything truly useful there is an open source implementation(s). I found openvpn, but decided against it because it had an overall client/server description and so didn’t seem quite what I had in mind. Openvpn does have a page about creating an ethernet bridging setup which is quite helpful, but when you install the product it is all about the client/server paradigm, which is really not what I had in mind for my application.

Then I learned about Astaro RED at the Amazon Cloud conference I attended. That’s RED as in Remote Ethernet Device. That sounded pretty good, but it didn’t seem quite what we were after. It must have looked good to Sophos as well because as I was studying it, Sophos bought them! Asataro RED is more for extending an ethernet to remote branch offices.

More promising for cheapo experimentation, or so I thought at the time, is etherip.

Very long story short, I never got that to work out in my environment, which was SLES VM servers.

What seems to be the most promising solution, and the most expensive, is overlay transport virtualization (OLTV or simply OTV), offered by Cisco in their Nexus switches. I’ll amend this post when I get a chance to see if it worked or not!

December Update
OTV is beginning to work. It’s really cool seeing it for the first time. For instance, I have a server in South Carolina on an OTV subnet, IP 10.94.45.2. Its default gateway is in New Jersey! Its gateway is in the ARP table, as it has to be, but merely to PING the gateway produces this unusual time lag:

> ping 10.94.45.1

PING 10.94.45.1 (10.194.54.33) 56(84) bytes of data.
64 bytes from 10.94.45.1: icmp_seq=1 ttl=255 time=29.0 ms
64 bytes from 10.94.45.1: icmp_seq=2 ttl=255 time=29.1 ms
64 bytes from 10.94.45.1: icmp_seq=3 ttl=255 time=29.6 ms
64 bytes from 10.94.45.1: icmp_seq=4 ttl=255 time=29.1 ms
64 bytes from 10.94.45.1: icmp_seq=5 ttl=255 time=29.4 ms

See those response times? Huge. I ping the same gateway from a different LAN but same server room in New Jersey and get this more typical result:

# ping 10.94.45.1

Type escape sequence to abort.
Sending 5, 64-byte ICMP Echos to 10.94.45.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/0/1 ms
Number of duplicate packets received = 0

But we quickly stumbled upon a gotcha. Large packets were killing us. The thing is that it’s one thing to run OTV over dark fiber, which we know another customer is doing without issues; but to run it in an MPLS network is something else.

Before making any adjustment on our servers we found behaviour like the following:
– initial ssh to linux server works OK; but session soon freezes after a directory listing or executing other commands
– pings with the -s parameter set to anything greater than 1430 bytes failed – they didn’t get returned

So this issue is very closely related to a problem we observed on a regular segment where getvpn had just been implemented. That problem, which manifested itself as occasional IE errors, is described in some detail here.

Currently we don’t see our carrier being able to accommodate larger packets so we began to see what we could alter on our servers. On Checkpoint IPSO you can lower the MTU as follows:

> dbset interface:eth1c0:ipmtu 1430

The change happens immediately. But that’s not a good idea and we eventually abandoned that approach.

On SLES Linux I did it like this:

> ifconfig eth1 mtu 1430

In this platform, too, the change takes place right away.

By the we experimented and found that the largest MTU value we could use was 1430. At this point I’m not sure how to make this change permanent, but a little research should show how to do it.

After changing this setting, our ssh sessions worked great, though now we can’t send pings larger than 1402 bytes.

The latest problem is that on our OTV segment we can ping only one device but not the other.

August 2013 update
Well, we are resourceful people so yes we got it running. Once the dust settled OTV worked pretty well, with certain concessions. We had to be able to control the MTU on at least one side of the connection, which, fortunately we always could. Load balancers, proxy servers, Linux servers, we ended up jiggering all of them to lower their MTU to 1420. For firewall management we ended up lowering the MTU on the centralized management station.

Firewalls needed further voodoo. After pushing policy clamping needs to be turned back on and acceleration off like this (for Checkpoint firewalls):

$ fw ctl set int fw_clamp_tcp_mss 1
$ fwaccel off

Conclusion
Having preserved IPs during a server move can be a great benefit and OTV permits it. But you’d better have a talented staff to overcome the hurdles that will accompany this advanced technology.

Categories
Admin Linux ntp SLES

The IT Detective Agency: ntp server shows the wrong time after patching

Intro
One of my ntp servers hadn’t been patched in awhile so it was time. Other systems rely on it for time synchronization. The next morning after the patching I noticed that the ntp service wasn’t even running. I started it and went about my business. Checking back some minutes later, it had died again. What happened, and how to get it fixed? Read on to see how we diagnosed and solved this puzzler.

The details
I like to use ntpq -p to query my ntp server – it’s easy to type! So when I started it up the results looked like this:

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*LOCAL(0)        .LOCL.          10 l   59   64   17    0.000    0.000   0.001
 drjegw.drjn.com 192.5.41.209     2 u   56   64   17    0.335  3605497  26.136
 drjegw2.drjn.co 192.5.41.41      2 u   56   64   17   19.241  3605534  39.621
 drjeuro.drjn.eu 128.252.19.1     2 u   60   64   17  105.970  3605532  38.946

That’s some offset, eh? 3.605 x 10^6 msec, or, when you think about it, just over an hour. And yet the local clock had no offset. Strange.

Date
I like to do a crude check of system time by running the date command – quickly – on two different systems. Lacking some sleep, I noticed eventually but not right away, that my ntpd server had a date that was retarded by almost exactly an hour. I didn’t notice it at first because I had trained myself to only look at the seconds, which were “only” off by five seconds.

I checked to see if the timezone or localization settings had been changed by the patching – they hadn’t. So I went ahead and advanced the system clock by an hour. Actually the yast GUI of SLES gave me the option to sync against a time server, so I chose my closest one and did that after I had stopped ntpd.

Next problem, please
That got the time in the ballpark. But ntpd still wasn’t behaving. It exhibited a strange behaviour I’ve never seen before – its offset kept increasing. I observed this behaviour over the course of several minutes:

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l    3   64  377    0.000    0.000   0.001
*drjegw.drjn.com 192.5.41.209     2 u  129  128  377    0.350  146.846  81.771
+drjegw2.drjn.co 192.5.41.41      2 u    1  128  377   20.211  183.047  97.286
+drjeuro.drjn.eu 128.252.19.1     2 u   72  128  377  104.931  161.696  79.561

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l    5   64  377    0.000    0.000   0.001
*drjegw.drjn.com 192.5.41.209     2 u    2  128  377    1.803  182.380  97.636
+drjegw2.drjn.co 192.5.41.41      2 u    3  128  377   20.211  183.047  97.286
+drjeuro.drjn.eu 128.252.19.1     2 u   74  128  377  104.931  161.696  79.561

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l   28   64  377    0.000    0.000   0.001
*drjegw.drjn.com 192.5.41.209     2 u   89  128  377    1.803  182.380  97.636
+drjegw2.drjn.co 192.5.41.41      2 u   90  128  377   20.211  183.047  97.286
+drjeuro.drjn.eu 128.252.19.1     2 u   32  128  377  104.667  197.864  96.296

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l    5   64  377    0.000    0.000   0.001
*drjegw.drjn.com 192.5.41.209     2 u    4  128  377    1.813  218.325 113.345
+drjegw2.drjn.co 128.118.25.5     2 u    5  128  377   19.667  219.077 113.157
+drjeuro.drjn.eu 128.252.19.1     2 u   75  128  377  104.667  197.864  96.296

Look at that offset column. See? It keeps going up, at about a rate of 40 msec every two minutes. It ain’t supposed to do that!

So a Unix pal of mine said he had encountered an issue in ntp and had commented out that local clock. I honestly had absolutely no idea what that LOCAL line did, but it had never hurt before.

The local clock comes from these lines in ntp.conf:

server 127.127.1.0              # local clock (LCL)
fudge  127.127.1.0 stratum 10   # LCL is unsynchronized

So I took those out, stopped ntpd with a sudo service ntp stop, synced the time with a sudo sntp -P no -r drjegw.drjn.com, and restarted ntpd. It didn’t work immediately, but it became apparent eventually that it was working.

Meantime I discovered the ntpdc command, which is kind of informative in this situation:

> ntpdc
ntpdc> loopinfo

offset:               0.097373 s
frequency:            -132.558 ppm
poll adjust:          12
watchdog timer:       841 s

This tells me the offset if 97 msec (already too large in my experience) and that for some reason the system clock hadn’t been adjusted in 841 s, and that the clock drift rate was -132 ppm – much, much higher than any other system

Then in a few minutes it clicked and got the offset in order:

ntpdc> loopinfo

offset:               0.000000 s
frequency:            -132.558 ppm
poll adjust:          4
watchdog timer:       11 s

So removing the local system clock seemed to be working. But what was the real cause of all this? I discussed it with an admin. Bear in mind that this is physical server. He said the system clock gets its time from a hardware clock which should be visible in the ILO. We checked it. Sure enough, there it was, reporting in the ILO – still, after we had fixed the problem at the OS level – as one hour retarded. There was no way to manually adjust it. The only option was to set up sntp servers, which we did, which forced the ILO to restart.

We logged back in to the ILO and voila, the time was right!

I now realize that in the OS the LOCAL Time was using that hardware clock, which must have drifted by an hour since the system was installed.

Before the patch the system was incrementally keeping up with the drift, making the necessary incremental changes periodically. But the discrepancy was too large for it after rebooting after the patching. In the /var/log/ntp I even see a line:

14 Sep 07:20:53 ntpd[10259]: time correction of 3605 seconds exceeds sanity limit (1000); set clock manually to the correct UTC time.

Conclusion
Now the system is better and we have:

ntpdc> loopinfo

offset:               0.057029 s
frequency:            -5.465 ppm
poll adjust:          4
watchdog timer:       24 s

That’s better, but the offset of 57 msec is still far larger than normal. But it’s useable for now.

Categories
Admin Linux SLES

The IT Detective Agency: A couple of our SLES servers are running very slowly

Intro
Sometimes truth is stranger than fiction, even in the IT realm. I actually had a mere supporting role in this case – credit must be given to my persistent and accomplished friends for finding the root cause.

This Unlikely case begins with a serendipitous accident
An incident accidentally gets assigned to me about a couple of application servers running slowly – the echo of command-line typing comes in fits and starts. Well, I quickly decide it’s not my issue and I look around to see who should handle it. Probably the network specialist, right? No other server is having this issue, and the servers with the issue are responding fine to PING on their local segment. But still, it sounds like a network problem. For instance an interface with duplex settings mismatched as compared to the switch port.

But I decide to be nice about it and approach the guy with the problem, “Freg,” and ask him what he thinks should be done with the ticket. He takes the opportunity to show me the problem in person. So I listen to his story and politely look. This took place on July 31st, mind you (yesterday). No one’s using theses servers until they tried today. They are too slow to use now. The last time they were used was about 50 days ago – they were fine then. There are two servers with a similar problem.

And it goes on like that. These are running an ERP app which starts a Java process. Both of these servers are VMs. So lots of facts are being thrown at me. Maybe some are relevant and some not. I have never heard of these servers and am not familiar with the app. No root access is available to us, but we can log on as the same user who runs the ERP app.

I do an uptime. It’s up 54 days. More imporantly, the load average is very high – 25 and pinned there because the 5- and 15-minute average is also around 25. I now feel comfortable explaining why the slow character-typing echo. So it’s not a network problem after all… Talk to a sysadmin is my advice. But he sits there expecting me to do more, to somehow offer something helpful…

So I hem and haw and do essentially the one thing I know how to do in these circumstances: run strace. I get the process number of the offending process and try to get some insight into what it’s spending its time on. I don’t do this very often, and when I do I usually don’t learn very much, but it is another piece of the puzzle: I have learned more than when I started out. Let’s say the process number was 26743, then I ran:

> strace -f -p 26743

and simply watched the output. The output was weird. It was filled with calls to a function I’ve never seen before: futex. In fact it was all futex calls, looping, rapidly, and the same calls, producing timeout errors.

You can look in section 2 of the man pages:

> man -s2 futex

on a Linux system and see for yourself that futex is the Fast userspace locking system call. Don’t ask me. I’d say it’s a kernel thing. But it intuitively doesn’t seem right that a program would be using excessive amounts of CPU doing nothing but this one system call. A more healthy program will be seen making a nice variety of calls, especially and usually TCP-related ones like open, read, socket, gethostbyname, etc.

A popular cause ruled out
I also checked out /etc/resolv.conf. It’s often that a sysadmin messes that file up with an invalid nameserver, or even, just the other day, a nameserver line that omitted the nameserver directive and only contained the IP address of the nameserver! The symptom of that is different. The initial login prompt comes slowly (as it times out doing a reverse lookup on your source IP address), but character echo after that is normal.

The leap second
Other ideas for isolating the problem: reboot, but turn off this process at start-up. I think the reboot did help. But our sysadmin found that in fact this is a known issue on Suse Linux (SLES).

From we learn that a leap second was added June 30th, 2012, and there is a problem associated with it. It “can cause applications that are using FUTEXes to consume 100% of CPU. The issue is present in all Linux kernel versions >= 2.6.22, therefor affecting SLES 11 SP1 and later releases.” Wow!

The remedy in that case is to execute the command

$ date -s “$(LC_ALL=C date)”

to trigger a clock_was_set() system call. We did this and it seems to have fixed our issue.

Case closed.

Conclusion
The best sleuthing involves multiple people looking at things. Sometimes their individual breakthroughs need to be combined. Here the incidental observation of futex calls helped associate a cpu problem to a kernel bug related to a leap second implementation. This also explains why the problem did not exist 50 days ago – that was before June 30th.