Today I had my choice of problems I could highlight, but I like this one the best. Our mail server delivers email to a wide variety of recipients. All was going well and it ran pretty much unattended until this week when it didn’t go so well. Most emails were getting delivered, but more and more were starting to pile up in the queues. This is the story of how we unraveled the mystery.
It’s best to work from examples I think. I noticed emails to me.com were being refused delivery as well as emails to rnbdesign.com. The latter is a smaller company so we heard from them the usual story that we’re the only ones who can’t send to them.
So I forced delivery with verbose logging. I’m running sendmail, so that looks like this:
> sendmail -qRrnbdesign.com -Cconfig_file -v
That didn’t work out, producing a no route to host type of error. I did a DNS lookup by hand. That showed one set of results, while sendmail was connecting to an entirely different IP address. How could that be??
I was at a loss so I do what I do when I’m desperate: strace. That looks like this:
> strace -f sendmail -qRrnbdesign.com -Cconfig_file -v > /tmp/strace 2>&1
That produced 12,000 lines of output. All the system calls that the process and any of its forked processes invoke. Is that too much to comb through by hand? No, not at all, not when you begin to see the patterns.
I pored over the trace, not knowing what most of it meant, but looking for especially any activity regarding networking and DNS. Around line 6,000 I found it. There was mention of nscd.
For the unaware the use of nscd (nameserver caching daemon) might seem innocent enough, or even good-intentioned. What could be wrong with caching frequently used DNS results? The only issue is that it doesn’t work right! nscd derives from UC Berkeley Unix code and has never been supported. I didn’t even like it when I was running SunOS. It caches the DNS queries but ignores TTLs. This is fatal for mail servers or just about anything you can think of, especially on servers that are infrequently booted as mine are.
I stopped nscd right away:
> service nscd stop
and re-ran the sendmail queue runner (same command as above). The rnbdesign.com emails flowed out instantly! Soon hundreds of stuck emails were flushed out.
Of course for good measure nscd had to be removed from the startup sequence:
> chkconfig nscd off
An IT pro always keeps unsolved mysteries in his mind. This time I knew I also had in hand the solution an earlier-documented mystery about email to paladinny.com.
nscd might show up in your SLES or OpenSuse server. I strongly suggest to disable it before you wind up with old DNS values and an extremely hard-to-debug issue.