Intro
I got a TXT from my monitoring system last night. I ignored it because I knew that someone was working on the firewall at that time. I’ve learned enough about human nature to know that it is easy to ignore the first alert. So I’ve actually programmed HP SiteScope alerts to send additional ones out after four hours of continuous errors. When I got the second one at 9 PM, I sprang into action knowing this was no false alarm!
The details
Thanks to a bank of still-green monitors I could pretty quickly rule out what wasn’t the matter. Other equipment on that subnet was fine, so the firewall/switch/router was not the issue. Then what the heck was it? And how badly was it impacting mail delivery?
This particular server has two network interfaces. Though the one interface was clearly unresponsive to SMTP, PING or any other protocol, I hadn’t yet investigated the other interface, which was more Internet-facing. I managed to find another Linux server on the outer network and tried to ping the outer interface. Yup. That worked. I tried a login. It took a whole long time to get through the ssh login, but then I got on and the server looked quite normal. I did a quick ifconfig – the inner interface listed up, had the right IP, looked completely normal. I tried some PINGs from it to its gateway and other devices on the inner network. Nothing doing. No PINGs were returned.
I happened to have access to the switch. I thought maybe someone had pulled out its cable. So I even checked the switch port. It showed connected and 1000 mbits, exactly like the other interface. So it was just too improbable that someone pulled out the cable and happened to plug another cable from another server into that same switch port. Not impossible, just highly improbable.
Then I did what all sysadmins do when encountering a funny error – I checked the messages file in /var/log/messages. At first I didn’t notice anything amiss, but upon closer inspection there was one line that was out-of-place from the usual:
Nov 8 16:49:42 drjmailgw kernel: [3018172.820223] do_IRQ: 1.221 No irq handler for vector (irq -1) |
Buried amidst the usual biddings of cron was a kernel message with an IRQ complaint. What the? I haven’t worried about IRQ since loading Slackware from diskettes onto my PC in 1994! Could it be? I have multiple ways to test when the interface died – SiteScope monitoring, even the mail log itself (surely its log would look very different pre- and post-problem.) Yup.
That mysterious irq error coincides with when communication through that interface stopped working. Oh, for the record it’s SLES 11 SP1 running on HP server-class hardware.
What about my mail delivery? In a panic, realizing that sendmail would be happy as a clam through such an error, I shut down its service. I was afraid email could be piling up on this server, for hours, and I pride myself in delivering a faultless mail service that delivers in seconds, so that would be a big blow. With sendmail shut down I knew the backup server would handle all the mail seamlessly.
This morning, in the comfort of my office I pursued the answer to that question What was happening to my mail stream during this time? I knew outbound was not an issue (actually the act of writing this down makes me want to confirm that! I don’t like to have falsehoods in writing. Correct, I’ve now checked it and outbound was working.) But it was inbound that really worried me. Sendmail was listening on that interface after all, so I didn’t think of anything obvious that would have stopped inbound from being readily accepted then subsequently sat on.
But such was not the case! True, the sendmail listener was available and listening on that external interface, but, I dropped a hint above. Remember that my ssh login took a long while? That is classic behaviour when a server can’t communicate with its nameservers. It tries to do a reverse lookup on the ssh client’s source IP address. It tried the first nameserver, but it couldn’t communicate with it because it was on the internal network! Then it tried its next nameserver – also a no go for the same reason. I’ve seen the problem so often I wasn’t even worried when the login took a long time – a minute or so. I knew to wait it out and that I was getting in.
But in sendmail I had figured that certain communications should never take a long time. So a long time ago I had lowered some of the default timeouts. My mc file in the upstream server contains these lines:
... dnl Do not use RFC1413 identd. p 762 It requires another whole in the F/W define(`confTO_IDENT',`0s')dnl dnl Set more reasonable timeouts for SMTP commands' define(`confTO_INITIAL',`1m')dnl define(`confTO_COMMAND',`5m')dnl define(`confTO_HELO',`1m')dnl define(`confTO_MAIL',`1m')dnl define(`confTO_RCPT',`1m')dnl define(`confTO_RSET',`1m')dnl define(`confTO_MISC',`1m')dnl define(`confTO_QUIT',`1m')dnl ... |
I now think that in particular the HELO timeout (TO_HELO) of 60 seconds saved me! The upstream server reported in its mail log:
Timeout waiting for input from drjmailgw during client greeting |
So it waited a minute, as drjmailgw tried to do a reverse lookup on its IP, unsuccessfully, before proceeding with the response to HELO, then went on to the secondary server as per the MX record in the mailertable. Whew!
More on that IRQ error
Let’s go back to that IRQ error. I got schooled by someone who knows these things better than I. He says the Intel chipset was limited insofar as there weren’t enough IRQs for all the devices people wanted to use. So engineers devised a way to share IRQs amongst multiple devices. Sort of like virtual IPs on one physical network interface. Anyways, on this server he suspects that something is wrong with the multipath driver which is loaded for the fiber channel host adapter card. In fact he noticed that the network interface flaked out several times previous to this error. Only it came back after some seconds. This is the server where we had a very high CPU when the SAN was being heavily used. The SAN vendor checked things out on their end and, of course, found nothing wrong with the SAN equipment. We actually switched from SAN to tmpfs after that. But we didn’t unload the multipath driver. Perhaps now we will.
Feb 22 Update
We haven’t seen the problem in over three weeks now. See my comments on what actions we took.
Conclusion
Persistence, patience and praeternatural practicality paid off in this perplexing puzzle!
2 replies on “The IT Detective Agency: mail server went down with an old-school problem”
hi.
did you solve this irq problem?
i have this same issue with opensuse here in Brazil.
Funny thing is, I just looked because you asked. We are still getting the IRQ error. It is then followed by a
kernel: [6003089.071200] bnx2: eth0 NIC Copper Link is Down
error, but three seconds later the link comes back up. We had applied patches. So I don’t know. We haven’t had a major outage like the one that prompted me to write this post. Maybe we’ve just been lucky. We also stopped using SAN storage, but I’m not even sure that’s related.
Feb 22 update – case resolved
Well, we haven’t had the error in over three weeks, so I think we finally cured it. What we did is to apply the latest firmware patches. This is probably waving a dead chicken, but you know how it is.
Then, someone looked into the Broadcom network card and its driver, etc. He decided that this problem could be caused by the MSI subsystem (Message Signalled Interrupts) and had seen some things about that on the Internet. It’s one thing to change a driver configuration, but how to make that change persist across reboots? What he did was to edit the file /etc/modprobe.conf.local and put this line into it:
We would normally have had a problem by now so I think these actions fixed it.
Case closed!