Categories
Admin Linux SLES

The IT Detective Agency: A couple of our SLES servers are running very slowly

Intro
Sometimes truth is stranger than fiction, even in the IT realm. I actually had a mere supporting role in this case – credit must be given to my persistent and accomplished friends for finding the root cause.

This Unlikely case begins with a serendipitous accident
An incident accidentally gets assigned to me about a couple of application servers running slowly – the echo of command-line typing comes in fits and starts. Well, I quickly decide it’s not my issue and I look around to see who should handle it. Probably the network specialist, right? No other server is having this issue, and the servers with the issue are responding fine to PING on their local segment. But still, it sounds like a network problem. For instance an interface with duplex settings mismatched as compared to the switch port.

But I decide to be nice about it and approach the guy with the problem, “Freg,” and ask him what he thinks should be done with the ticket. He takes the opportunity to show me the problem in person. So I listen to his story and politely look. This took place on July 31st, mind you (yesterday). No one’s using theses servers until they tried today. They are too slow to use now. The last time they were used was about 50 days ago – they were fine then. There are two servers with a similar problem.

And it goes on like that. These are running an ERP app which starts a Java process. Both of these servers are VMs. So lots of facts are being thrown at me. Maybe some are relevant and some not. I have never heard of these servers and am not familiar with the app. No root access is available to us, but we can log on as the same user who runs the ERP app.

I do an uptime. It’s up 54 days. More imporantly, the load average is very high – 25 and pinned there because the 5- and 15-minute average is also around 25. I now feel comfortable explaining why the slow character-typing echo. So it’s not a network problem after all… Talk to a sysadmin is my advice. But he sits there expecting me to do more, to somehow offer something helpful…

So I hem and haw and do essentially the one thing I know how to do in these circumstances: run strace. I get the process number of the offending process and try to get some insight into what it’s spending its time on. I don’t do this very often, and when I do I usually don’t learn very much, but it is another piece of the puzzle: I have learned more than when I started out. Let’s say the process number was 26743, then I ran:

> strace -f -p 26743

and simply watched the output. The output was weird. It was filled with calls to a function I’ve never seen before: futex. In fact it was all futex calls, looping, rapidly, and the same calls, producing timeout errors.

You can look in section 2 of the man pages:

> man -s2 futex

on a Linux system and see for yourself that futex is the Fast userspace locking system call. Don’t ask me. I’d say it’s a kernel thing. But it intuitively doesn’t seem right that a program would be using excessive amounts of CPU doing nothing but this one system call. A more healthy program will be seen making a nice variety of calls, especially and usually TCP-related ones like open, read, socket, gethostbyname, etc.

A popular cause ruled out
I also checked out /etc/resolv.conf. It’s often that a sysadmin messes that file up with an invalid nameserver, or even, just the other day, a nameserver line that omitted the nameserver directive and only contained the IP address of the nameserver! The symptom of that is different. The initial login prompt comes slowly (as it times out doing a reverse lookup on your source IP address), but character echo after that is normal.

The leap second
Other ideas for isolating the problem: reboot, but turn off this process at start-up. I think the reboot did help. But our sysadmin found that in fact this is a known issue on Suse Linux (SLES).

From we learn that a leap second was added June 30th, 2012, and there is a problem associated with it. It “can cause applications that are using FUTEXes to consume 100% of CPU. The issue is present in all Linux kernel versions >= 2.6.22, therefor affecting SLES 11 SP1 and later releases.” Wow!

The remedy in that case is to execute the command

$ date -s “$(LC_ALL=C date)”

to trigger a clock_was_set() system call. We did this and it seems to have fixed our issue.

Case closed.

Conclusion
The best sleuthing involves multiple people looking at things. Sometimes their individual breakthroughs need to be combined. Here the incidental observation of futex calls helped associate a cpu problem to a kernel bug related to a leap second implementation. This also explains why the problem did not exist 50 days ago – that was before June 30th.

Categories
Admin Linux SLES

How to Get By Without unix2dos in SLES

Intro
As a Unix old-timer looking at the latest releases, I only have observed one tendency – that of ever-increasing numbers of commands, always additive – until now. A command I considered useful (well, basically any command I have ever used I consider useful) has gone AWOL in Suse Linux Enterprise Server (SLES for short): unix2dos.

Why You Need It
These days you need it more than ever. What with sftp being used in place of ftp, your transferred text files will come over from a SLES server to your PC in binary mode, preserving the Linux-style way of declaring a new line with the newline character, “\n”. Bring that file onto your PC and look at it in Notepad and you’ll get one long line because Windows requires more to indicate a new line. Windows OS’s like Windows 7 require a carriage return + newline, i.e., “\r\n”.

Who You Going to Call
I spoke with some experts so I cannot take credit for finding this out personally. Long story short things evolved and there is a more sophisticated command available that does this sort of thing and much else. That’s recode.

But I don’t think I’ll ever use recode for anything else so I decided to re-create a unix2dos command using recode in a tiny shell script:

#!/bin/sh
# inspired by http://yourlinuxguy.com/?p=232 and the fact that they took away this useful command
# 3/6/12
recode latin1..ibmpc $*

You call it like this:

> unix2dos file

and it overwrites the file and converts it to the format Windows expects.

My other expert contact says I could find the old unix2dos in OpenSuse but I decided not to go that route.

Of course to convert in the other direction you have dos2unix which for some reason wasn’t removed from the distro. Strange, huh?

How to See That It Worked
I use

> od -c file|more

to look at the ascii characters in a text file. It also shows the newline and carriage return characters with a \n and \r respectively This is a good command to know because it is also a “safe” way to look at a binary file. By safe I mean it won’t try to print out 8-bit characters that will permanently mess your terminal settings!

2017 update
I finally needed this utility again after five years. My program doesn’t work on CentOS. – No recode, whatever that was. However, the one-liner provided in the comments worked just fine for me.

Conclusion
We can rest easy and send text files back-and-forth between a PC and a SLES server with the help of this unix2dos script we developed.

Interestingly, RedHat’s RHEL has kept unix2dos in its disrtibution. Good for them. In ubuntu Linux unix2dos also seems decidedly missing.

Categories
Admin Internet Mail Linux SLES

Building sendmail on SLES

Intro
My sendmail binary built for SLES 10 SP 3 was not working well at all on SLES 11 SP1. It became apparent that libraries were not compatible so it was time to re-compile. I’ve documented that journey here. There were a few pitfalls along the way so I felt it was worth a blog post should anyone else ever need to do this.

February 2013 update
And now I’ve repeated the journey on SLES 11 SP2 – and ran into new problems! I’ll put that story in the appendix below.

Why Build?
Why build sendmail when you can find a package for it? For security it’s a good idea to run the latest version. It’s easier to defend during an audit. So when I look via zypper, I see it proposes me sendmail v 8.14.3:

# sudo zypper if sendmail
Refreshing service 'nu_novell_com'.
Loading repository data...
Reading installed packages...
 
Information for package sendmail:
 
Repository: SLES11-SP1-Pool
Name: sendmail
Version: 8.14.3-50.20.1
Arch: x86_64
Vendor: SUSE LINUX Products GmbH, Nuernberg, Germany
...

I go to sendmail.org and find that the latest version is actually 8.14.5. And that’s fairly typical. The distributed release is over a year old.

Where this can really matter is when a vulnerability comes out. If you can roll your own you can be the first on the block with that vulnerability fixed – not putting yourself at the mercy of a vendor busy with hundreds of other distractions. And I have seen this phenomenon in action.

Distributing Your Build
I’m mixing up the order here.

Once you have your sendmail built, what’s the minimum set of things you need to put it on other servers?

For one, you gotta have a database package with sendmail. For historical reasons I use sleepycat (I think formerly known as Berkeley db). Only it was gobbled up by Oracle. I don’t have a feel for what the future holds, though I fear decline. Sleepycat provides db. I grabbed the RPM from rpmfind.net:

db-4.7.25-1rt.x86_64.rpm.

This package has to go on each server where you will run sendmail if you are using db for your database. First issue in trying to install this RPM:

# sudo rpm -i db-4.7.25-1rt.x86_64.rpm
        file /usr/bin/db_archive from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_checkpoint from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_deadlock from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_dump from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_hotbackup from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_load from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_printlog from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_recover from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_stat from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_upgrade from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_verify from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64

I don’t really know if any of these conflicting files are used by the system. I also don’t know how to install “all but the conflicting files” in RPM. So we’ll try our luck and simply overwrite them:

# sudo rpm -i –force db-4.7.25-1rt.x86_64.rpm

Now no errors are reported.

We gotta make another kludge. sendmail, or at least the version I compiled, needs this db library in /usr/lib64, but the RPM puts it in /usr/lib. So…

# cd /usr/lib64; sudo ln -s ../lib/libdb-4.7.so

By the way, how did I decide that db has to be brought over? I fired up sendmail and got this error:

# sendmail
sendmail: error while loading shared libraries: libdb-4.7.so: cannot open shared object file: Error 40

Now with the sym link made I get a more pleasant application error:

# sendmail
can not chdir(/var/spool/clientmqueue/): Permission denied
Program mode requires special privileges, e.g., root or TrustedUser.

Continuing my out-of-order documentation(!), I create an init script in /etc/init.d and make sure sendmail is going to be started at boot:

# sudo chkconfig -s drjohnssendmail 35

I like to put the logs in /maillog:

# sudo rm /var/log/mail; sudo mkdir !$; sudo ln -s !$ /maillog

I like to have the logging customized a bit, so I modify syslog-ng.conf somewhat:

# DrJ attempt to define filter based on match of sm-mta
filter f_sm-mta     { match("sm-mta"); };
filter f_fs-mta     { match("fs-mta"); };

and

...
#destination mail { file("/var/log/mail"); };
#log { source(src); filter(f_mail); destination(mail); };
 
destination mailwarn { file("/var/log/mail/mail.warn" perm(0644)); };
log { source(src); filter(f_mailwarn); destination(mailwarn); };
 
destination mailerr  { file("/var/log/mail/mail.err" perm(0644) fsync(yes)); };
log { source(src); filter(f_mailerr);  destination(mailerr); };
 
#
# and also all in one file:
#
#
# and also all in one file:
#
destination mail { file("/var/log/mail/stat.log" perm(0644)); };
log { source(src); filter(f_sm-mta); destination(mail); };
log { source(src); filter(f_fs-mta); destination(mail); };

Followed by a

# sudo service syslog restart

Going Back to Compiling
So how did we compile sendmail in the first place, which was supposed to be the subject of this blog?

We downloaded the latest version from sendmail.org and unpacked it. Then we read the INSTALL file in the sendmail-8.14.5 directory for general guidance about the steps.

To make our ocmpilation configuration portable, we try to encapsulate our idiosyncracies in one file, devtools/Site/site.config.m4, which we create. Mine looks as follows:

dnl  DrJ config file for corporate mail delivery
dnl I am leaving out the ldap stuff because we stopped using it
dnl 
dnl which maps we will support - NEWDB is automatic if it finds the db libs
dnl in Linux NDBM is really GDBM, which isn't supported.  NEWDB support is not automatic
define(`confMAPDEF',`-DNEWDB -DMAP_REGEX')
dnl Berkeley DB is here...
dnl this doesn't work, exactly - put the db libs directly into /usr/lib
APPENDDEF(`confLIBDIRS',`-L/usr/local/ssl/lib -L/usr/lib64')
dnl libdb-4 looks like the sleepycat library on Linux
APPENDDEF(`confLIBS',`-ldb-4')
APPENDDEF(`confINCDIRS',`-I/opt/local/include -I/usr/local/ssl/include')
dnl where to put smrsh and mail.local programs
define(`confEBINDIR', `/opt/mail/bin')
dnl where to install include files
define(`confINCLUDEDIR', `/opt/mail/include')
dnl where to install library files
define(`confLIBDIR', `/opt/mail/lib')
dnl man pages
define(`confMANROOT', `/opt/local/man/cat')
dnl unformatted man pages
define(`confMANROOTMAN', `/opt/local/man/man')
dnl the sendmail binary goes into MBINDIR
define(`confMBINDIR', `/opt/mail/bin')
dnl programs only executed by root go to sbin
define(`confSBINDIR', `/opt/mail/sbin')
dnl shared library directory
define(`confSHAREDLIBDIR', `/opt/mail/lib')
dnl user-executable pgms, newaliases, mailq, hoststat, etc
define(`confUBINDIR', `/opt/mail/bin')
dnl TLS support 
APPENDDEF(`conf_sendmail_ENVDEF', `-DSTARTTLS')
APPENDDEF(`conf_sendmail_LIBS', `-lssl -lcrypto')

I’ll explain a bit of this file since of course it is critical to what we are trying to do. I arrived at its current state from a little experimentation, so I don’t know the full explanation of some of the settings. What it’s saying is that we use the NEWDB, which uses that db we spoke of earlier for our maps. I like to install the binaries into /opt/mail/bin. We like to have the option to run TLS.

With that set we run the compile:

# cd sendmail; sh ./Build

and it spits out some errors to the screen which indicate we’re missing some SSL headers. We get them with:

# sudo zypper source-install openssl

Now with any luck it will fully compile.

At this point it helps to create some of the target directories:

# sudo mkdir -p /opt/mail/{bin,sbin} /opt/local/man/cat{1,5,8}

And we create a user account, smmsp, with uid and gid of 225, and a group with the same name.

And then we can install it:

# sudo sh ./Build install

The install should work, mostly. But makemap doesn’t get put in /opt/mail for some reason. So you have to copy it by hand from sendmail-8.14.5/obj.Linux.2.6.32.12-0.7-default.x86_64/makemap to /opt/mail/sbin, for instance. You really need to have a makemap.

Finally, I suggest to recursively copy the cf directory:

# cp -r sendmail-8.14.5/cf /opt/mail

This way you have a pretty relocatable set of files under /opt/mail.

Appendix: Building on SLES 11 SP2
I thought this would be a breeze. Just look at my own blog posting above! Not so fast – that approach didn’t work at all.

You have to appreciate that under SLES 11 SP1 we needed a few key packages that aren’t very common:

– libopenssl-devel-0.9.8h-30.27.11
– zlib-devel-1.2.3-106.34

We pulled them from the SDK DVD. Well, turns out there is no SDK DVD for SLES 11 SP2! Novell, probably in one of those beloved cost-saving measures, no longer releases an SDK DVD. What to do?

Well, I found what not to do. I began copying key files from my SLES 11 SP1 installation, like libcrypto.a, libssl.a, /usr/include/ssl. This all helped to reduce the number of errors. But at the end of the day there was an error I couldn’t chase away no matter what:

/usr/src/packages/BUILD/openssl-0.9.8h/crypto/comp/c_zlib.c:235: undefined reference to `inflate'

Meanwhile the resourceful sysadmin found those development packages. He told me about SUSE Studio, which allows you to build your own distribution. He looked for those development packages in the distribution, found and installed them:

$ rpm -qa|grep devel

libopenssl-devel-0.9.8j-0.26.1
zlib-devel-1.2.3-106.34
zlib-devel-32bit-1.2.3-106.34

Then my build went through fine. Whew!

Conclusion
I could go on and on about details in the setup, but the scope here is the compilation, and we’ve covered that pretty well.

Once you get over the pain of compilation setup, sendmail runs great and is a great MTA.

Categories
Admin IT Operational Excellence Linux SLES

The IT Detective Agency: Cognos stopped working

Intro
Here’s another in our continuing exciting IT drama. A user reports that her Cognos app stopped working. She’s in charge of the Cognos application servers, I run the Cognos gateway on a Linux server. I have almost no working knowledge of Cognos. I learned just enough to get the gateway installed and configured on Linux, specifically SLES. Cognos is used for business intelligence reports and is now owned by IBM.

The Details
The home page came up just fine, so I knew the web server – Apache, of course – was working. I know I hadn’t changed anything on the gateway. She also says that she hadn’t changed anything on the dispatcher. So she asks me to save the config. It’s an X application. I run cogconfig.sh, which by the way is in COGNOS-INSTALL_DIR/bin64, not COGNOS-INSTALL_DIR/bin, contrary to the documentation for Linux. I cannot save the config. She asks me to export it. I can’t do that either! I get the error

CAM-CRP-1057 unable to generate the machine specific symmetric key.

She asks me to delete the keypairs. These are in the directories COGNOS-INSTALL_DIR/configuration/{signkeypair,encryptkeypair}. So I clear out those. Still I cannot save or export the configuration. I quickly switch to a Solaris server which we had hoped to retire in order to get a working gateway while we mulled the problem over.

Over the next days I checked to see if Java had changed. Getting a working JRE was a little tricky on SLES. Nothing had changed. After the system admin came back from vacation the next week I asked if by chance. The last log showed he was logged in at the time. He admits to changing one thing.

He changed the system name. This system has multiple interfaces and a unique hostname for each interface. The hosts file in /etc/hosts included entries for each of the interface IPs. Seeing there were no other changes I concluded that this little innocent act was enough to kill the communication. Note that he did not change any of the routing, however. When you’re dealing with encryption, it can be that the system name is significant. So when those keys were initially generated they were tied to that name and would only work with that original hostname. At least that is my reverse engineering of the matter. Cognos is a pretty closed system so it’s hard to pin down more precisely what is going on.

Conclusion
The hostname was changed back to the original name. Sure enough, now I can export the config and most importantly, save it without any errors.

Case closed!

Lessons Learned
Well, avoiding finger-pointing and quick judgements was helpful in this case. Of course I suspected she actually had done something to the dispatcher, but I behaved as though the problem might be on my side. We treated each other professionally while the system was down and we had no clue why. That was very helpful.

Categories
Admin IT Operational Excellence Linux SLES

The IT Detective Agency: the case of the messages from mars

Intro
Today we got a “funny” message on our SLES 11 server in the /var/log/warn file. You might think that Martians have landed!

The Details
Specifically this:

Nov 9 10:54:19 drjohn24 kernel: [72397.088297] martian source 10.120.2.24 from 10.0.0.3, on dev eth1
Nov 9 10:54:19 drjohn24 kernel: [72397.088300] ll header: 78:e7:d1:7b:25:32:00:a0:8e:a8:8e:b3:08:00

Every time I pinged 10.120.2.24 (drjohn24) from 10.0.0.3 it would produce those two lines in the warn and messages file. More worrisome, I could not ssh from one host to the other. I could ssh from a host on the local network to drjohn24. We observed this behaviour even with the firewall disabled. Strange, right?

One more thing to note: drjohn24 has two network interfaces and various routes defined.

The Solution
It didn’t take too long to get to the bottom of this. We set up the routes wrong. We meant to create a default route out of eth0, which was right, and a net-10 route for eth1, which we specified incorrectly. Do

netstat -rn

to show all routes. I had this:

Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
10.120.2.0      0.0.0.0         255.255.255.128 U         0 0          0 eth1
10.120.3.0      0.0.0.0         255.255.255.0   U         0 0          0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 eth0
10.0.0.0        10.120.2.1      255.255.255.128 UG        0 0          0 eth1
128.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 lo
0.0.0.0         10.120.3.1      0.0.0.0         UG        0 0          0 eth0

Do you see the error? We put the mask on the 10.0.0.0 the same as we put on the interface and that’s not what we wanted.

The corrected version looks like this:

Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
...
10.0.0.0        10.120.2.1      255.0.0.0       UG        0 0          0 eth1
...

Conclusion
So what was happening is that the inbound packet from 10.0.0.3 was arriving at eth0 as we intended. But SLES 11 is now clever enough to realize, based on its routing table, that that is not the expected interface where a packet with that source IP should arrive. It should have arrived at eth1 because of the default route. No other static route was more specific for 10.0.0.3 due to our error. And apparently even with firewall turned off, SLES gets very defensive at this point. I’m not sure if it was sending return packets out of eth1 or not, because I kept looking for them out of eth0!

Once we corrected the routes the inbound packet arrived at eth0 and was returned with an answer packet from eth0 and the martian messages went away.

The martian message thing is a little obscure, and at the time more a distraction than anything else as we had to research what that meant. I guess for the future we’ll instantly know. It’s very similar to defining network topology on your firewalls in an anti-spoofing defense.

Case closed!