Sometimes truth is stranger than fiction, even in the IT realm. I actually had a mere supporting role in this case – credit must be given to my persistent and accomplished friends for finding the root cause.
This Unlikely case begins with a serendipitous accident
An incident accidentally gets assigned to me about a couple of application servers running slowly – the echo of command-line typing comes in fits and starts. Well, I quickly decide it’s not my issue and I look around to see who should handle it. Probably the network specialist, right? No other server is having this issue, and the servers with the issue are responding fine to PING on their local segment. But still, it sounds like a network problem. For instance an interface with duplex settings mismatched as compared to the switch port.
But I decide to be nice about it and approach the guy with the problem, “Freg,” and ask him what he thinks should be done with the ticket. He takes the opportunity to show me the problem in person. So I listen to his story and politely look. This took place on July 31st, mind you (yesterday). No one’s using theses servers until they tried today. They are too slow to use now. The last time they were used was about 50 days ago – they were fine then. There are two servers with a similar problem.
And it goes on like that. These are running an ERP app which starts a Java process. Both of these servers are VMs. So lots of facts are being thrown at me. Maybe some are relevant and some not. I have never heard of these servers and am not familiar with the app. No root access is available to us, but we can log on as the same user who runs the ERP app.
I do an uptime. It’s up 54 days. More imporantly, the load average is very high – 25 and pinned there because the 5- and 15-minute average is also around 25. I now feel comfortable explaining why the slow character-typing echo. So it’s not a network problem after all… Talk to a sysadmin is my advice. But he sits there expecting me to do more, to somehow offer something helpful…
So I hem and haw and do essentially the one thing I know how to do in these circumstances: run strace. I get the process number of the offending process and try to get some insight into what it’s spending its time on. I don’t do this very often, and when I do I usually don’t learn very much, but it is another piece of the puzzle: I have learned more than when I started out. Let’s say the process number was 26743, then I ran:
> strace -f -p 26743
and simply watched the output. The output was weird. It was filled with calls to a function I’ve never seen before: futex. In fact it was all futex calls, looping, rapidly, and the same calls, producing timeout errors.
You can look in section 2 of the man pages:
> man -s2 futex
on a Linux system and see for yourself that futex is the Fast userspace locking system call. Don’t ask me. I’d say it’s a kernel thing. But it intuitively doesn’t seem right that a program would be using excessive amounts of CPU doing nothing but this one system call. A more healthy program will be seen making a nice variety of calls, especially and usually TCP-related ones like open, read, socket, gethostbyname, etc.
A popular cause ruled out
I also checked out /etc/resolv.conf. It’s often that a sysadmin messes that file up with an invalid nameserver, or even, just the other day, a nameserver line that omitted the nameserver directive and only contained the IP address of the nameserver! The symptom of that is different. The initial login prompt comes slowly (as it times out doing a reverse lookup on your source IP address), but character echo after that is normal.
The leap second
Other ideas for isolating the problem: reboot, but turn off this process at start-up. I think the reboot did help. But our sysadmin found that in fact this is a known issue on Suse Linux (SLES).
From we learn that a leap second was added June 30th, 2012, and there is a problem associated with it. It “can cause applications that are using FUTEXes to consume 100% of CPU. The issue is present in all Linux kernel versions >= 2.6.22, therefor affecting SLES 11 SP1 and later releases.” Wow!
The remedy in that case is to execute the command
$ date -s “$(LC_ALL=C date)”
to trigger a clock_was_set() system call. We did this and it seems to have fixed our issue.
The best sleuthing involves multiple people looking at things. Sometimes their individual breakthroughs need to be combined. Here the incidental observation of futex calls helped associate a cpu problem to a kernel bug related to a leap second implementation. This also explains why the problem did not exist 50 days ago – that was before June 30th.