Categories
Admin Internet Mail SLES

The IT Detecive Agency: emails began piling up this week, no obvious cause

Intro
Today I had my choice of problems I could highlight, but I like this one the best. Our mail server delivers email to a wide variety of recipients. All was going well and it ran pretty much unattended until this week when it didn’t go so well. Most emails were getting delivered, but more and more were starting to pile up in the queues. This is the story of how we unraveled the mystery.

The details
It’s best to work from examples I think. I noticed emails to me.com were being refused delivery as well as emails to rnbdesign.com. The latter is a smaller company so we heard from them the usual story that we’re the only ones who can’t send to them.

So I forced delivery with verbose logging. I’m running sendmail, so that looks like this:

> sendmail -qRrnbdesign.com -Cconfig_file -v

That didn’t work out, producing a no route to host type of error. I did a DNS lookup by hand. That showed one set of results, while sendmail was connecting to an entirely different IP address. How could that be??

I was at a loss so I do what I do when I’m desperate: strace. That looks like this:

> strace -f sendmail -qRrnbdesign.com -Cconfig_file -v > /tmp/strace 2>&1

That produced 12,000 lines of output. All the system calls that the process and any of its forked processes invoke. Is that too much to comb through by hand? No, not at all, not when you begin to see the patterns.

I pored over the trace, not knowing what most of it meant, but looking for especially any activity regarding networking and DNS. Around line 6,000 I found it. There was mention of nscd.

For the unaware the use of nscd (nameserver caching daemon) might seem innocent enough, or even good-intentioned. What could be wrong with caching frequently used DNS results? The only issue is that it doesn’t work right! nscd derives from UC Berkeley Unix code and has never been supported. I didn’t even like it when I was running SunOS. It caches the DNS queries but ignores TTLs. This is fatal for mail servers or just about anything you can think of, especially on servers that are infrequently booted as mine are.

I stopped nscd right away:

> service nscd stop

and re-ran the sendmail queue runner (same command as above). The rnbdesign.com emails flowed out instantly! Soon hundreds of stuck emails were flushed out.

Of course for good measure nscd had to be removed from the startup sequence:

> chkconfig nscd off

An IT pro always keeps unsolved mysteries in his mind. This time I knew I also had in hand the solution an earlier-documented mystery about email to paladinny.com.

Conclusion
nscd might show up in your SLES or OpenSuse server. I strongly suggest to disable it before you wind up with old DNS values and an extremely hard-to-debug issue.

Case closed!

Categories
Internet Mail Linux Perl Raspberry Pi

Raspberry Pi phone home

Intro
In this article I described setting up my Raspberry Pi without ever connecting a monitor keyboard and mouse to it and how I got really good performance using an UHS SD card.

This article represents my first real DIY project on my Pi – one of my own design. My faithful subscribers will recall my post after Hurricane Sandy in which I reacted to an intense desire to know when the power was back on by creating a monitor for that situation. It relied on extremely unlikely pieces of infrastructure. I hinted that it may be possible to use the Raspberry Pi to accomplish the same thing.

I’ve given it a lot of thought and assembled all the pieces. Now I have a home power/Internet service monitor based on my Pi!

This still requires a somewhat unlikely but not impossible combination of infrastructure:
– your own hosted server in the cloud
– ability to send emails out from your cloud server
– access log files on your cloud server are rolled over regularly
– your Pi and your cloud server are in the same time zone
– Raspberry Pi which is acting as a server (meaning you are running it 24×7 and not rebooting it and fooling with it too much)
– a smart phone to receive alert emails or TXT messages

I used my old-school knowledge of Perl to whip something up quickly. One of this years I have to bite the bullet and learn Python decently, but it’s hard when you are so comfortable in another language.

The details
Here’s the concept. From your Pi you make regular “phone home” calls to your cloud server. This could use any protocol your server is listening on, but since most cloud servers run web servers, including mine, I phone home using HTTP. Then on your cloud server you look for the phone home messages. If you don’t see one after a certain time, you send an alert to an email account. Then, once service – be it power or Internet connectivity – is restored to your house, your Pi resumes phoning home and your cloud monitor detects this and sends a Good message.

I have tried to write minimalist code that yet will work and even handle common error conditions, so I think it is fairly robust.

Set up your Pi
On your Pi you are “phoning home” to your server. So you need a line something like this in your crontab file:

# This gets a file and leaves a timestamp behind in the access log
* * * * * /usr/bin/curl --connect-timeout 30 http://yourcloudserver.com/raspberrypiPhoneHome?`perl -e 'print time()'` > /dev/null 2>&1

Don’t know what I’m talking about when I say edit your crontab file?

> export EDITOR=vi
> crontab -e

That first line is only required for fans of the vi editor.

That part was easy, right? That will have your server “phone home” every minute, 24×7. But we need an aside to talk about time on the Pi.

Getting the right time on the Raspberry Pi
This monitoring solution assumes Ras Pi and home server are in the same time zone (because we kept it simple). I’ve seen at least a couple of my Raspberry Pi’s where the time zone was messed up so I need to document the fix.

Run the date command
$ date

Sat Apr 29 17:10:13 EDT 2017

Now it shows it is set for EDT so the timezone is correct. At first it showed something like UTC.

Make sure you are running ntp:
$ ntpq ‐p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+time.tritn.com  63.145.169.3     2 u  689 1024  377   78.380    2.301   0.853
-pacific.latt.ne 68.110.9.223     2 u  312 1024  377  116.254   11.565   5.864
+choppa.chieftek 172.16.182.1     3 u  909 1024  377   65.430    4.185   0.686
*nero.grnet.gr   .GPS.            1 u  106 1024  377  162.089  -10.357   0.459

You should get results similar to those above. In particular the jitter numbers should be small, or at least less than 10 (units are msec for the curious).

If you’re missing the ntpq command then do a

$ sudo apt-get install ntp

Set the correct timezone with a

$ sudo dpkg-reconfigure tzdata

and choose Americas, then new York, or whatever is appropriate for your geography. The Internet has a lot of silly advice on this point so I hope this clarifies the point.

Note that you need to do both things. In my experience time on Raspberry Pis tends to drift so you’ll be off by seconds, which is a bad thing. ntp addresses that. And having it in the wrong timezone is just annoying in general as all your logs and file times etc will be off compared to how you expect to see them.

On your server
Here is the Perl script I cooked up. Some modifications are needed for others to use, such as email addresses, access log location and perhaps the name and switches for the mail client.

So without further ado. here is the monitor script:

#!/usr/bin/perl
# send out alerts related to Raspberry Pi phone home
# this is designed to be called periodically from cron
# DrJ - 2/2013
#
# to test good to error transition,
# call with a very small maxDiff, such as 0!
use Getopt::Std;
getopts('m:d'); # maximum allowed time difference
$maxDiff = $opt_m;
$DEBUG = 1 if $opt_d;
unless (defined($maxDiff)) {
  usage();
  exit(1);
}
# use values appropriate for your situation here...
$mailsender = '[email protected]';
$recipient = '[email protected]';
$monitorName = 'Raspberry Pi phone home';
# access line looks like:
# 96.15.212.173 - - [02/Feb/2013:22:00:02 -0500] "GET /raspberrypiPhoneHome?136456789 HTTP/1.1" 200 455 "-" "curl/7.26.0"
$magicString = "raspberrypiPhoneHome";
# modify as needed for your situation...
$accessLog = "/var/log/drjohns/access.log";
#
# pick up timestamp in access file
$piTime = `grep $magicString $accessLog|tail -1|cut -d\? -f2|cut -d' ' -f1`;
$curTime = time();
chomp($time);
$date = `date`;
chomp($date);
# your PID file is somewhere else. It tells us when Apache was started.
# you could comment out these next lines just to get started with the program
$PID = "/var/run/apache2.pid";
($atime,$mtime,$ctime) = (stat($PID))[8,9,10];
$diff = $curTime - $piTime;
print "magicString, accessLog, piTime, curTime, diff: $magicString, $accessLog, $piTime, $curTime, $diff\n" if $DEBUG;
print "accessLog stat. atime, mtime, ctime: $atime,$mtime,$ctime\n" if $DEBUG;
if ($curTime - $ctime < $maxDiff) {
  print "Apache hasn't been running long enough yet to look for something in the log file. Maybe next time\n";
  exit(0);
}
#
$goodFile = "/tmp/piGood";
$errorFile = "/tmp/piError";
#
# Think of it as state machine. There are just a few states and a few transitions to consider
#
if (-e $goodFile) {
  print "state: good\n" if $DEBUG;
  if ($diff < $maxDiff) {
    print "Remain in good state\n" if $DEBUG;
  } else {
# transition to error state
    print "Transition from good to error state at $date, diff is $diff\n";
    sendMail("Good","Error","Last call was $diff seconds ago");
# set state to Error
    system("rm $goodFile; touch $errorFile");
  }
} elsif (-e $errorFile) {
  print "state: error\n" if $DEBUG;
  if ($diff > $maxDiff) {
    print "Remain in error state\n" if $DEBUG;
  } else {
# transition to good state
    print "Transition from error to good state at $date, diff is $diff\n";
    sendMail("Error","Good","Service restored. Last call was $diff seconds ago");
# set state to Good
    system("rm $errorFile; touch $goodFile");
  }
} else {
  print "no state\n" if $DEBUG;
  if ($diff < $maxDiff) {
    system("touch $goodFile");
    sendMail("no state","Good","NA") if $DEBUG;
    print "Transition from no state to Good at $date\n";
# don't send alert
  } else {
    print "Remain in no state\n" if $DEBUG;
  }
}
####################
sub sendMail {
($oldState,$state,$additional) = @_;
print "oldState,state,additional: $oldState,$state,$additional\n" if $DEBUG;
$subject = "$state : $monitorName";
open(MAILX,"|mailx -r \"$mailsender\" -s \"$subject\" $recipient") || die "Cannot run mailx $mailsender $subject!!\n";
print MAILX qq(
$monitorName is now in state: $state
Time: $date
Former state was $oldState
Additional info: $additional
 
- sent from pialert program
);
close(MAILX);
 
}
###############################
sub usage {
  print "usage: $0 -m <maxDiff (seconds)> [-d (debug)]\n";
}

This is called from my server’s crontab. I set it like this:

 Call monitor that sends an alert if my Raspberry Pi fails to phone home - DrJ 2/13
0,5,10,15,20,25,30,35,40,45,50,55 * * * * /home/drj/pialert.pl -m 300 >> /tmp/pialert.log

My /tmp/pialert.log file looks like this so far:

Transition from no state to Good at Wed Feb  6 12:10:02 EST 2013
Apache hasn't been running long enough yet to look for something in the log file. Maybe next time
Apache hasn't been running long enough yet to look for something in the log file. Maybe next time
Transition from good to error state at Fri Feb  8 10:55:01 EST 2013, diff is 420
Transition from error to good state at Fri Feb  8 11:05:02 EST 2013, diff is 1

The last two lines result from a test I ran – i commented out the crontab entry on my Pi to be absolutely sure it was working.

The error message I got in my email looked like this:

Subject: Error : Raspberry Pi phone home
 
Raspberry Pi phone home is now in state: Error 
Time: Fri Feb  8 10:55:01 EST 2013
Former state was Good
Additional info: Last call was 420 seconds ago
 
- sent from pialert program

Why not use Nagios?
Some will realize that I replicated functions that are provided in Nagios, why not just hang my stuff off that well-established monitoring software? I considered it, but I wanted to stay light. I think my approach, while more demanding of me as a programmer, keeps my server unburdened by running yet another piece of software that has to be understood, debugged, maintained and patched. If you already have it, fine, by all means use it for the alerting part. I’m sure it gives you more options. For an approach to installing nagios that makes it somewhat manageable see the references.

A few words about sending mail
I send mail directly from my cloud server, I have no idea what others do. With Amazon, my elastic IP was initially included in blacklists (RBLs), etc, so I really couldn’t send mail without it being rejected. they have procedures you can follow to remove your IP from those lists, and it really worked. Crucially, it allowed me to send as a TXT message. Just another reason why you can’t really beat Amazon hosting (there was no charge for this feature).

And sending TXT messages
I think most wireless providers have an email gateway that allows you to send a TXT message (SMS) to one of their users via email (SMTP) if you know their cell number. For instance with Verizon the formula is

Verizon

AT&T
cell-number@txt.att.net.

T-Mobile
cell-number@tmomail.net

Conclusion
We have assembled a working power/Internet service monitor as a DIY project for a Raspberry Pi. If you want to use your Pi for a lot of other things I suggest to leave this one for your power monitor and buy another – they’re cheap (and fun)!

I will now know whenever I lose power – could be any minute now thanks to Nemo – and when it is restored, even if I am not home (thanks to my SmartPhone). See in my case my ISP, CenturyLink, is pretty good and rarely drops my service. JCP&L, not so much.

Admittedly, most people, unlike me, do not have their own cloud-hosted server, but maybe it’s time to get one?

References
Open Monitoring Distribution (OMD) makes installing and configuring nagios a lot easier, or so I am told. It is described here.
I’ve gotten my mileage out of the monitor perl script in this post: I’ve recently re-used it with modifications for a similar situation except that the script is being called by HP SiteScope, and, again, a Raspberry Pi is phoning home. Described here.

Categories
Admin DNS Internet Mail SLES

Strange problem with email to paladinny.com

Intro
This is probably the most obscure of all postings I will ever do – it’s really just opening up my private journal to the Internet, which helps me when I need to recall how I fixed something.

So the story is that I’m having trouble sending email to anyone in the domain paladinny.com, and I just couldn’t figure out why.

The details
With my sendmail config I finally rolled up my sleeves, and did some debugging, even though I am pressed for time. Start up our sendmail debugging session:

> sendmail -Cconfig_file.cf -bt -d35.9

This produces a lot of blah, blah, configuration settings, blah, blah, and finally a sort of sendmail debugging shell. So let’s test a good “normal” domain:

> 3,0 [email protected]

canonify           input: test @ gmail . com
Canonify2          input: test < @ gmail . com >
Canonify2        returns: test < @ gmail . com . >
canonify         returns: test < @ gmail . com . >
parse              input: test < @ gmail . com . >
Parse0             input: test < @ gmail . com . >
Parse0           returns: test < @ gmail . com . >
ParseLocal         input: test < @ gmail . com . >
ParseLocal       returns: test < @ gmail . com . >
Parse1             input: test < @ gmail . com . >
Mailertable        input: < gmail . com > test < @ gmail . com . >
Mailertable        input: gmail . < com > test < @ gmail . com . >
Mailertable      returns: test < @ gmail . com . >
Mailertable      returns: test < @ gmail . com . >
SmartTable         input: test < @ gmail . com . >
SmartTable       returns: test < @ gmail . com . >
MailerToTriple     input: < > test < @ gmail . com . >
MailerToTriple   returns: test < @ gmail . com . >
Parse1           returns: $# esmtp $@ gmail . com . $: test < @ gmail . com . >
parse            returns: $# esmtp $@ gmail . com . $: test < @ gmail . com . >

and then this problem domain:

> 3,0 [email protected]

canonify           input: test @ paladinny . com
Canonify2          input: test < @ paladinny . com >
Canonify2        returns: test < @ paladinny . no-ip . biz . >
canonify         returns: test < @ paladinny . no-ip . biz . >
parse              input: test < @ paladinny . no-ip . biz . >
Parse0             input: test < @ paladinny . no-ip . biz . >
Parse0           returns: test < @ paladinny . no-ip . biz . >
ParseLocal         input: test < @ paladinny . no-ip . biz . >
ParseLocal       returns: test < @ paladinny . no-ip . biz . >
Parse1             input: test < @ paladinny . no-ip . biz . >
Mailertable        input: < paladinny . no-ip . biz > test < @ paladinny . no-ip . biz . >
Mailertable        input: paladinny . < no-ip . biz > test < @ paladinny . no-ip . biz . >
Mailertable        input: paladinny . no-ip . < biz > test < @ paladinny . no-ip . biz . >
Mailertable      returns: test < @ paladinny . no-ip . biz . >
Mailertable      returns: test < @ paladinny . no-ip . biz . >
Mailertable      returns: test < @ paladinny . no-ip . biz . >
SmartTable         input: test < @ paladinny . no-ip . biz . >
SmartTable       returns: test < @ paladinny . no-ip . biz . >
MailerToTriple     input: < > test < @ paladinny . no-ip . biz . >
MailerToTriple   returns: test < @ paladinny . no-ip . biz . >
Parse1           returns: $# esmtp $@ paladinny . no-ip . biz . $: test < @ paladinny . no-ip . biz . >
parse            returns: $# esmtp $@ paladinny . no-ip . biz . $: test < @ paladinny . no-ip . biz . >

I have to look more into what Canonify2 does. But this gives me an idea: force the mailertable to handle paladinny . no-ip . biz the way I want it to, namely:

paladinny.no-ip.biz relay:barracuda.cblconsulting.com

because in DNS my DNS server returns this funny result:

> dig mx paladinny.com

; <<>> DiG 9.6-ESV-R7-P3 <<>> mx paladinny.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17559
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 0
 
;; QUESTION SECTION:
;paladinny.com.                 IN      MX
 
;; ANSWER SECTION:
paladinny.com.          351     IN      CNAME   paladinny.no-ip.biz.
 
;; AUTHORITY SECTION:
no-ip.biz.              60      IN      SOA     nf1.no-ip.com. hostmaster.no-ip.com. 2052775595 600 300 604800 600
 
;; Query time: 30 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jan 18 08:53:49 2013
;; MSG SIZE  rcvd: 121

whereas Google’s public DNS says this, which looks like the intended result:

> dig mx paladinny.com @8.8.8.8

; <<>> DiG 9.6-ESV-R7-P3 <<>> mx paladinny.com @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3749
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
 
;; QUESTION SECTION:
;paladinny.com.                 IN      MX
 
;; ANSWER SECTION:
paladinny.com.          1800    IN      MX      10 barracuda.cblconsulting.com.
 
;; Query time: 236 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Fri Jan 18 08:55:42 2013
;; MSG SIZE  rcvd: 71

So at least we know where that odd paladinny.no-ip.biz comes from, sort of. It comes from my nameserver, but where it got that answer from I have no idea. It doesn’t come from the authoritative nameservers:

> dig mx paladinny.com @dns1.name-services.com.

; <<>> DiG 9.6-ESV-R7-P3 <<>> mx paladinny.com @dns1.name-services.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45704
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
 
;; QUESTION SECTION:
;paladinny.com.                 IN      MX
 
;; ANSWER SECTION:
paladinny.com.          1800    IN      MX      10 barracuda.cblconsulting.com.
 
;; Query time: 82 msec
;; SERVER: 98.124.192.1#53(98.124.192.1)
;; WHEN: Fri Jan 18 08:59:50 2013
;; MSG SIZE  rcvd: 71

A CNAME is not an MX record, so why my nameserver is returning an answer (ANSWER: 1)when queried for the MX record when all it thinks it has is a CNAME seems to be an out-and-out error.

And putting the resolved name in the mailertable is also not normal. Normally you put the domain itself, as in:

paladinny.com relay:barracuda.cblconsulting.com

and of course that’s the first thing I tried, but it has no effect whatsoever.

February Update and Conclusion
The mystery was solved when a whole bunch of email deliveries started failing on my system and I was forced to do some serious debugging. Long story short my SLES system was regrettably running nscd, the nameserver caching daemon. I didn’t even bother to check paladinny.com. So many other things cleared up when I killed it I’m sure it was the cause of the paladinny.com issue as well. This is all described in this post.

Categories
Admin Internet Mail Linux Network Technologies

The IT Detective Agency: mail server went down with an old-school problem

Intro
I got a TXT from my monitoring system last night. I ignored it because I knew that someone was working on the firewall at that time. I’ve learned enough about human nature to know that it is easy to ignore the first alert. So I’ve actually programmed HP SiteScope alerts to send additional ones out after four hours of continuous errors. When I got the second one at 9 PM, I sprang into action knowing this was no false alarm!

The details
Thanks to a bank of still-green monitors I could pretty quickly rule out what wasn’t the matter. Other equipment on that subnet was fine, so the firewall/switch/router was not the issue. Then what the heck was it? And how badly was it impacting mail delivery?

This particular server has two network interfaces. Though the one interface was clearly unresponsive to SMTP, PING or any other protocol, I hadn’t yet investigated the other interface, which was more Internet-facing. I managed to find another Linux server on the outer network and tried to ping the outer interface. Yup. That worked. I tried a login. It took a whole long time to get through the ssh login, but then I got on and the server looked quite normal. I did a quick ifconfig – the inner interface listed up, had the right IP, looked completely normal. I tried some PINGs from it to its gateway and other devices on the inner network. Nothing doing. No PINGs were returned.

I happened to have access to the switch. I thought maybe someone had pulled out its cable. So I even checked the switch port. It showed connected and 1000 mbits, exactly like the other interface. So it was just too improbable that someone pulled out the cable and happened to plug another cable from another server into that same switch port. Not impossible, just highly improbable.

Then I did what all sysadmins do when encountering a funny error – I checked the messages file in /var/log/messages. At first I didn’t notice anything amiss, but upon closer inspection there was one line that was out-of-place from the usual:

Nov  8 16:49:42 drjmailgw kernel: [3018172.820223] do_IRQ: 1.221 No irq handler for vector (irq -1)

Buried amidst the usual biddings of cron was a kernel message with an IRQ complaint. What the? I haven’t worried about IRQ since loading Slackware from diskettes onto my PC in 1994! Could it be? I have multiple ways to test when the interface died – SiteScope monitoring, even the mail log itself (surely its log would look very different pre- and post-problem.) Yup.

That mysterious irq error coincides with when communication through that interface stopped working. Oh, for the record it’s SLES 11 SP1 running on HP server-class hardware.

What about my mail delivery? In a panic, realizing that sendmail would be happy as a clam through such an error, I shut down its service. I was afraid email could be piling up on this server, for hours, and I pride myself in delivering a faultless mail service that delivers in seconds, so that would be a big blow. With sendmail shut down I knew the backup server would handle all the mail seamlessly.

This morning, in the comfort of my office I pursued the answer to that question What was happening to my mail stream during this time? I knew outbound was not an issue (actually the act of writing this down makes me want to confirm that! I don’t like to have falsehoods in writing. Correct, I’ve now checked it and outbound was working.) But it was inbound that really worried me. Sendmail was listening on that interface after all, so I didn’t think of anything obvious that would have stopped inbound from being readily accepted then subsequently sat on.

But such was not the case! True, the sendmail listener was available and listening on that external interface, but, I dropped a hint above. Remember that my ssh login took a long while? That is classic behaviour when a server can’t communicate with its nameservers. It tries to do a reverse lookup on the ssh client’s source IP address. It tried the first nameserver, but it couldn’t communicate with it because it was on the internal network! Then it tried its next nameserver – also a no go for the same reason. I’ve seen the problem so often I wasn’t even worried when the login took a long time – a minute or so. I knew to wait it out and that I was getting in.

But in sendmail I had figured that certain communications should never take a long time. So a long time ago I had lowered some of the default timeouts. My mc file in the upstream server contains these lines:

...
dnl Do not use RFC1413 identd. p 762  It requires another whole in the F/W
define(`confTO_IDENT',`0s')dnl
dnl Set more reasonable timeouts for SMTP commands'
define(`confTO_INITIAL',`1m')dnl
define(`confTO_COMMAND',`5m')dnl
define(`confTO_HELO',`1m')dnl
define(`confTO_MAIL',`1m')dnl
define(`confTO_RCPT',`1m')dnl
define(`confTO_RSET',`1m')dnl
define(`confTO_MISC',`1m')dnl
define(`confTO_QUIT',`1m')dnl
...

I now think that in particular the HELO timeout (TO_HELO) of 60 seconds saved me! The upstream server reported in its mail log:

Timeout waiting for input from drjmailgw during client greeting

So it waited a minute, as drjmailgw tried to do a reverse lookup on its IP, unsuccessfully, before proceeding with the response to HELO, then went on to the secondary server as per the MX record in the mailertable. Whew!

More on that IRQ error
Let’s go back to that IRQ error. I got schooled by someone who knows these things better than I. He says the Intel chipset was limited insofar as there weren’t enough IRQs for all the devices people wanted to use. So engineers devised a way to share IRQs amongst multiple devices. Sort of like virtual IPs on one physical network interface. Anyways, on this server he suspects that something is wrong with the multipath driver which is loaded for the fiber channel host adapter card. In fact he noticed that the network interface flaked out several times previous to this error. Only it came back after some seconds. This is the server where we had a very high CPU when the SAN was being heavily used. The SAN vendor checked things out on their end and, of course, found nothing wrong with the SAN equipment. We actually switched from SAN to tmpfs after that. But we didn’t unload the multipath driver. Perhaps now we will.

Feb 22 Update
We haven’t seen the problem in over three weeks now. See my comments on what actions we took.

Conclusion
Persistence, patience and praeternatural practicality paid off in this perplexing puzzle!

Categories
Admin DNS Internet Mail

The IT Detective Agency: can’t get email from one sender

Intro
For this article to make any sense whatsoever you have to understand that I enforce SPF in my mail system, which I described in SPF – not all it’s cracked up to be.

The details
Well, some domain admins boldly eliminated their SOFTFAIL conditions – but didn’t quite manage to pull it off correctly! Today I ran into this example. A sender from the domain pclnet.net sent me email from IP 64.8.71.112 which I didn’t get – my SPF protection rejected it. The sender got an error:

550 IP Authorization check failed - psmtp

Let’s look at his SPF record with this DNS query:

$ dig txt pclnet.net

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.10.rc1.el6_3.2 <<>> txt pclnet.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42145
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
 
;; QUESTION SECTION:
;pclnet.net.                    IN      TXT
 
;; ANSWER SECTION:
pclnet.net.             300     IN      TXT     "v=spf1 mx ip4:24.214.64.230 ip4:24.214.64.231 ip4:64.8.71.110 ipv4:64.8.71.111 ipv4:64.8.71.112 -all"

That IP, 64.8.71.112 is right there at the end. So what’s the deal?

Well, Google/Postini was called in for help. They apparently still have people who are on the ball because they noticed something funny about this SPF record, namely, that it isn’t correct. Notice that the first few IPs are prefixed with an ip4? Well the last IPs are prefixed with an ipv4! They are not both valid. In fact the ipv4 is not valid syntax and so those IPs are not considered by programs which evaluate SPF records, hence the rejection!

My recourse in this case was to remove SPF enforcement on an exception basis for this one domain.

Case closed!

Conclusion
It’s now a few months after my original post about SPF. I’m sticking with it and hope to increase its adoption more broadly. It has worked well, and the exceptions, such as today’s, have been few and far between. It’s a good tool in the fight against spam.

Categories
Admin Internet Mail

SPF – not all it’s cracked up to be

Intro
I recently implemented a strict enforcement of Sender Policy Framework records on a mail server. The amount of false positives wasn’t worth the added benefit.

The details
When I want to learn about SPF I turn to the Wikipedia article. Many secure mail gateways offer the possibility to filter out emails based on the sending MTA being in the list allowed by that domains SPF record in DNS. I began to turn on enforcement domain-by-domain. First for bbb.org since their domain was spoofed by an annoying spam campaign earlier this year. Then fedex.com (another perennial favorite), aexp.com, amazon.com, ups.com, etc.

I was stricter than the standard. FAIL -> reject; SOFTFAIL -> reject! And it seemed to work well. Last week I went whole hog and turned on SPF enforcement for all domains with a defined SPF record, but with slightly relaxed standards. FAIL -> REJECT; SOFTFAIL -> quarantine. While it’s true that it may have helped prevent some new spam campaigns, complaints started mounting. Stuff was going into quarantine that never used to!

I knew SOFTFAIL must be the issue. Most(?) domains seem to have a SOFTFAIL condition. Here is a typical example for the domain communica-usa.com:

$ dig txt communica-usa.com

; <<>> DiG 9.7.3-P3-RedHat-9.7.3-8.P3.el6_2.2 <<>> txt communica-usa.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15711
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
 
;; QUESTION SECTION:
;communica-usa.com.             IN      TXT
 
;; ANSWER SECTION:
communica-usa.com.      600     IN      TXT     "v=spf1 ip4:72.241.62.1/26 ~all"

So the ~all at the end means they have defined a SOFTFAIL that matches “everything else.” And if a domain does have a SOFTFAIL contingency in its SPF record, this is almost always how it is defined. Which, when you think about it, is the same, formally, as not having bothered to define an SPF record at all! Because the proper action to take is NONE.

I reasoned as follows: why bother defining an SPF record unless you know what you’re doing and you feel you actually can identify the MTAs your mail will come from? Except when you’re first starting out and learning the ropes. That’s why I took stronger action than the standard suggested for SOFTFAIL.

A learning experience
Well, it wasn’t immediate, but after a week to ten days the complaints started rolling in. Looking at what had actually been released from quarantine was an eye-opener. It confirmed something I had previously seen, namely that only a minority speak up so the problems brought to my attention amounted to the tip of the iceberg. So the problem of my own creation was quite widespread in actual fact. I investigated a few by hand. Yup, they were sent from IPs not permitted by the narrowly defined part of their SPF record, and they were repeat offenders. Domains such as the aforementioned communica-usa.com, right.com, redphin.com, etc, etc. I haven’t gotten an explanation for a single infringement – finding someone with sufficient knowledge in these small companies to have a discussion with is a real challenge.

Waving the white flag
I had to back down on my quarantining SOFTFAILs in the global SPF setting. Now it’s SOFTFAIL -> PASS. I could have made domain-by-domain exceptions, but when the number of problem domains climbed over a hundred I decided against that reactive approach. I kept the original set of defined domains, ups.com, etc with their more-aggressive settings. These are larger companies, anyways, which either didn’t even have a SOFTFAIL defined, or never needed their SOFTFAIL.

August 2013 update
Well, a year has passed since this experiment – long enough to establish a trend. I was hoping by writing this article to nudge the community in the direction of wider adoption of SPF usage and more importantly, enforcement. Alas, I can now attest that there is no perceptible trend in either direction. How do I know? Because even with our more lax enforcement, we still ran into problems with lots of senders. I.e., they were too, how do I say this politely, confused, and apparently unable to make their strict SPF records actually match their sending hosts! Incredible, but true. If there were widespread or even spotty adoption, even 5 – 10 % of MTAs adopting enforcement of strict SPF records, they would also have refused messages from senders like this and these senders would have been motivated to fix their self-inflicted problems. But, in reality, what I have seen is complete lack of understanding amongst any of the dozen or so companies with this issue and complete inability to resolve it, forcing me to make an exception for their error.

SPF records can be bad for a number of reasons. Some companies provide two of them, and that is a source of problems. Here are a couple examples I had this issue with:

scterm.com

scterm.com.             3600    IN      TXT     "v=spf1 ip4:65.122.37.36 include:spf.protection.outlook.com -all"
scterm.com.             3600    IN      TXT     "MS=ms56827535"

atlantichealth.org

atlantichealth.org.     3600    IN      TXT     "v=spf1 mx ip4:198.140.183.244/32 ip4:198.140.183.245/32 ip4:198.140.184.244/32 ip4:198.140.184.245/32 ip4:207.211.31.0/25 ip4:205.139.110.0/24 ip4:205.139.111.0/24 -all"

So my guesstimate is < 1% adoption of even the most conservative (meaning least likely to reject legitimate emails) SPF enforcement. Too bad about that. Spammers really appreciate it, though. Conclusion
By being forced to back off aggressive SPF settings most domains with SOFTFAIL defined as “~all” – which is most of them – are lost to any enforcement. It’s a shame. SPF sounded so promising to me at first. Spammers are really good at controlling botnets, but terrible at controlling corporate MTAs. But still I do advocate for wider adoption of SPF, without the SOFTFAIL condition, of course!

References
A nice SPF validator is found here. It has no obnoxious ads or spyware.

Categories
Admin Internet Mail Linux

Yahoo! stopped accepting emails – what we can learn about sendmail from this

Flash Update
I noticed yahoo.com, yahoo.ca and yahoo.com.br had stopped accepting emails today.

I wanted to provide insight into this problem that few others would have, namely, when did the problem first occur?

I looked at my stuck messages in queue. I realized that with sendmail, there is no easy way to answer the question: what is the oldest stuck message for domain XYZ in the queue? So I rigged up something, very crude, but typical for a Unix command line-type of guy.

Namely:

$ grep yahoo */qf*|grep for|cut -d\; -f2|sort|head

The answer? My first stuck message comes from 12:43:46 EST today (July 31st). As of this writing, 3:53 PM, the problem still exists, which already makes it a quite long outage even if it is fixed in the next few minutes. Just minutes later, around 4 PM, I noticed a lot of the messages being delivered, so the problem seems to have finally cleared up.

The expression above works if you are root and the current directory is, e.g., /mqueue, under which you have queue directories q0, q1, …, q9, etc corresponding to an MC statement:

define(`QUEUE_DIR',`/mqueue/q*')dnl

The grep above might miss a few messages, but if you have lots of them as I do, it doesn’t really matter as it’s purpose is just to convey the general idea of when the problem started. In my case it can be safely assumed that I am continuously sending emails to yahoo.com, so it is not possible to have a window that isn’t covered of more than ten minutes or so during the day.

Conclusion
We helped document a complete three-hour outage of Yahoo mail. Along the way we learned of a deficiency in sendmail’s mailq command – how limited its reporting options are. We compensated for that by rolling our own series of commands to answer the question of what is the oldest mail of this type in our queues.

Categories
Admin Internet Mail Linux Perl

The IT Detective Agency: last letter of attachment name is missing!

Intro
Today we bring you an IT whodunit thriller. A user using Lotus Notes informs his local IT that a process that emails SQL reports to him and a few others has suddenly stopped working correctly. The reports either contain an HTML attachment where the attachment type has been chopped to “ht” instead of “htm,” or an MHTML attachment type which has also been chopped, down to “mh” instead of “mht.” They get emailed from the reporting server to a sendmail mail relay. Now the convenient ability to double-click on the attachment and launch it stopped working as a result of these chopped filenames. What’s going on? Fix it!

Let’s Reproduce the Problem
Fortunately this one was easier than most to reproduce. But first a digression. Let’s have some fun and challenge ourselves with it before we deep dive. What do you think the culprit is? What’s your hypothesis? Drawing on my many years of experience running enterprise-class sendmail servers, and never before having seen this problem despite the hundreds of millions of delivered emails, my best instincts told me to look elsewhere.

The origin server, let’s call it aspen, sends few messages, so I had the luxury to turn on tracing on my sendmail server with a filter limiting the traffic to its IP:

$ tcpdump -i eth0 -s 1540 -w /tmp/aspen.cap host aspen

Using wireshark to analyze asp.cap and following the tcp stream I see this:

...
Content-Type: multipart/mixed;
		 boundary="CSmtpMsgPart123X456_000_C800C42D"
 
This is a multipart message in MIME format
 
--CSmtpMsgPart123X456_000_C800C42D
Content-Type: text/plain;
		 charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
 
SQLplus automated report
--CSmtpMsgPart123X456_000_C800C42D
Content-Type: application/octet-stream;
		 name="tower status_2012_06_04--09.25.00.htm"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
		 filename="tower status_2012_06_04--09.25.00.htm
 
<html><head></head><body><h1>Content goes here...</h1></body>
</html>
 
--CSmtpMsgPart123X456_000_C800C42D--

Result of trace of original email as received by sendmail

But the source as viewed from within Lotus Notes is:

...
Content-Type: multipart/mixed;
		 boundary="CSmtpMsgPart123X456_000_C800C42D"
 
 
--CSmtpMsgPart123X456_000_C800C42D
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
		 charset="iso-8859-1"
 
SQLplus automated report
--CSmtpMsgPart123X456_000_C800C42D
Content-Type: application/octet-stream;
		 name="tower status_2012_06_04--09.25.00.htm"
Content-Disposition: attachment;
		 filename="tower status_2012_06_04--09.25.00.ht"
Content-Transfer-Encoding: base64
 
PGh0bWw+PGhlYWQ+PC9oZWFkPjxib2R5PjxoMT5Db250ZW50IGdvZXMgaGVyZS4uLjwvaDE+PC9i
b2R5Pg0KPC9odG1sPg==
 
--CSmtpMsgPart123X456_000_C800C42D--

Same email after being trasferred to Lotus Notes

I was in shock.

I fully expected the message source to go through unaltered all the way into Lotus Notes, but it didn’t. The trace taken before sendmail’s actions was not an exact match to the source of the message I received. So either sendmail or Lotus Notes (or both) were altering the source in significant ways.

At the same time, we got a big clue as to what is behind the missing letter in the file extension. To highlight it, compare this line from the trace:

filename=”tower status_2012_06_04–09.25.00.htm

to that same line as it appears in the Lotus Notes source:

filename=”tower status_2012_06_04–09.25.00.ht

So there is no final close quote (“) in the filename attribute as it comes from the aspen server! That can’t be good.

But it used to work. What do we make of that fact??

I had to dig farther. I was suddenly reminded of the final episode of House where it is apparent that the solving the puzzle of symptoms is the highest aspiration for Doctor House. Maybe I am similarly motivated? Because I was definitely willing to throw the full weight of my resources behind this mystery. At least for the half-day I had to spare on this.

First step was to reproduce the problem myself. For sending an email you would normally use sendmail or mailx or such, but I didn’t trust any of those programs – afraid they would mess with my headers in secret, undocumented ways.

So I wrote my own mail sending program using Perl/Expect. Now I’m not advocating this as a best practice. It’s just that for me, given my skillset and perceived difficulty in finding a proper program to do what I wanted (which I’m sure is out there), this was the path of least resistance, the best and most efficient use of my time. You see, I already had the core of the program written for another purpose, so I knew it wouldn’t be too difficult to finish for this purpose. And I admit I’m not the best at Expect and I’m not the best at Perl. I just know enough to get things done and pretty quickly at that.

OK. Enough apologies. Here’s that code:

#!/usr/bin/perl
# drjohnstechtalk.com - 6/2012
# Send mail by explicit use of the protocol
$DEBUG = 1;
use Expect;
use Getopt::Std;
getopts('m:r:s:');
$recip = $opt_r;
$sender = $opt_s;
$hostname = $ENV{HOSTNAME};
chop($hostname);
print "hostname,mailhost,sender,recip: $hostname,$opt_m,$sender,$recip\n" if $DEBUG;
$telnet = "telnet";
@hosts = ($opt_m);
$logf = "/var/tmp/smtpresults.log";
 
$timeout = 15;
 
$data = qq(Subject: test of strange MIME error
X-myHeader: my-value
From: $sender
To: $recip
Subject: SQLplus Report - tower status
Date: Mon, 4 Jun 2012 9:25:10 --0400
Importance: Normal
X-Mailer: ATL CSmtp Class
X-MSMail-Priority: Normal
X-Priority: 3 (Normal)
MIME-Version: 1.0
Content-Type: multipart/mixed;
        boundary="CSmtpMsgPart123X456_000_C800C42D"
 
This is a multipart message in MIME format
 
--CSmtpMsgPart123X456_000_C800C42D
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
 
SQLplus automated report
--CSmtpMsgPart123X456_000_C800C42D
Content-Type: application/octet-stream;
        name="tower status_2012_06_04--09.25.00.htm"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
        filename="tower status_2012_06_04--09.25.00.htm
 
<html><head></head><body><h1>Content goes here...</h1></body>
</html>
--CSmtpMsgPart123X456_000_C800C42D--
 
.
);
sub myInit {
# This structure is ugly (p.148 in the book) but it's clear how to fill it
@steps = (
        { Expect => "220 ",
          Command => "helo $hostname"},
# Envelope sender
        { Expect => "250 ",
          Command => "mail from: $sender"},
# Envelope recipient
        { Expect => "250 ",
          Command => "rcpt to: $recip"},
# data command
        { Expect => "250 ",
          Command => "data"},
# start mail message
        { Expect => "354 Enter ",
          Command => $data},
# end session nicely
        { Expect => "250 Message accepted ",
          Command => "quit"},
);
}       # end sub myInit
#
# Main program
open(LOGF,">$logf") || die "Cannot open log file!!\n";
foreach $host (@hosts) {
  login($host);
}
 
# create an Expect object by spawning another process
sub login {
($host) = @_;
myInit();
#@params = ($host," 25");
$init_command = "$telnet $host 25";
#$Expect::Debug = 3;
my $exp = Expect->spawn("$init_command")
         or die "Cannot spawn $command: $!\n";
#
# Now run all the other commands
foreach $step (@steps) {
  $i++;
  $expstr = %{$step}->{Expect};
  $cmd = %{$step}->{Command};
#  print "expstr,cmd: $expstr, $cmd\n";
# Logging
#$exp->debug(2);
#$exp->exp_internal(1);
  $exp->log_stdout(0);  # disable stdout for each command
  $exp->log_file($logf);
  @match_patterns = ($expstr);
  ($matched_pattern_position, $error, $successfully_matching_string, $before_match, $after_match) = $exp->expect($timeout,
@match_patterns);
  unless ($matched_pattern_position == 1) {
    $err = 1;
    last;
  }
  #die "No match: error was: $error\n" unless $matched_pattern_position == 1;
  # We got our match. Proceed.
  $exp->send("$cmd\n");
}       # end loop over all the steps
 
#
# hard close
$exp->hard_close();
close(LOGF);
#unlink($logf);
}       # end sub login

Code for sendmsg2.pl

Invoke it:

$ ./sendmsg2.pl -m sendmail_host -s [email protected] -r [email protected]

The nice thing with this program is that I can inject a message into sendmail, but also I can inject it directly into the Lotus Notes smtp gateway, bypassing sendmail, and thereby triangulate the problem. The sendmail and Lotus Notes servers have slightly different responses to the various protocol stages, hence I clipped the Expect strings down to the minimal common set of characters after some experimentation.

This program makes it easy to test several scenarios of interest. Leave the final quote and inject into either sendmail or Lotus Notes (LN). Tack on the final quote to see if that really fixes things. The results?

Missing final quote

with final quote added

inject to sendmail

ht” in final email to LN; extension chopped

htm” and all is good

inject to LN

htm in final email; but extension not chopped

htm” and all is good

I now had incontrovertible proof that sendmail, my sendmail was altering the original message. It is looking at the unbalanced quote mark situation and recovering as best as possible by replacing the terminating character “m” with the missing double quote “. I was beginning to suspect it. After that shock drained away, I tried to check the RFCs. I figured it must be some well-meaning attempt on its part to make things right. Well, the RFCs, 822 and 1806 are a little hard to read and apply to this situation.

Let’s be clear. There’s no question that the sender is wrong and ought to be closing out that quote. But I don’t think there’s some single, unambiguous statement from the RFCs that make that abundantly apparent. Nevertheless, of course that’s what I told them to do.

The other thing from reading the RFC is that the whole filename attribute looks optional. To satisfy my curiosity – and possibly provide more options for remediation to aspen – I sent a test where I entirely left out the offending filename=”tower… line. In that case the line above it should have its terminating semicolon shorn:

Content-Disposition: attachment

After all, there already is a name=”tower…” as a Content-type parameter, and the string following that was never in question: it has its terminating semicolon.

Yup, that worked just great too!

Then I thought of another approach. Shouldn’t the overriding definition of the what the filetype is be contained in the Content-type header? What if it were more correctly defined as

Content-type: text/html

?

Content-type appears in two places in this email. I changed them both for good measure, but left the unbalanced quotations problem. Nope. Lotus Notes did not know what to with the attachment it displays as tower status_2012_06_04–09.25.00.ht. So we can’t recommend that course of action.

What Sendmail’s Point-of-View might be
Looking at the book, I see sendmail does care about MIME headers, in particular it cares about the Content-Disposition header. It feels that it is unreliable and hence merely advisory in nature. Also, some years ago there was a sendmail vulnerability wherein malformed multipart MIME messages could cause sendmail to crash (see http://www.kb.cert.org/vuls/id/146718. So maybe sendmail is just a little sensitive to this situation and feels perfectly comfortable and justified in right-forming a malformed header. Just a guess on my part.

Case closed.

Conclusion
We battled a strange email attachment naming error which seemed to be an RFC violation of the MIME protocols. By carefully constructing a testing program we were easily able to reproduce the problem and isolate the fault and recommend corrective actions in the sending program. Now we have a convenient way to inject SMTP email whenever and wherever we want. We feel sendmail’s reputation remains unscathed, though its corrective actions could be characterized as overly solicitous.

Categories
Internet Mail

How to run sendmail in queue-only mode

Intro
I guess I’ve ragged on sendmail before. Incredibly powerful program. Finding out how to do that simple thing you want to do may not be so easy, even with the bible at your side. So to that end I’m making an effort to document those simple things which I’ve found I’ve struggled with.

The Details
Today I wanted to capture all email coming into my sendmail daemon. Well, actually it’s a little more complicated. I didn’t want to disturb production email, but I wanted to capture a spam sample. Today there was a hugely effective spam campaign purporting to be email from the Better Business Bureau (BBB). All the emails however actually came from various senders @aicpa.org. Postini put a filter in place but I knew more were getting through. But they weren’t coming to me. How to get capture them without disturbing users?

In this post I gave some obscure but useful tips for sendmail admins, including the ever-useful smarttable add-on. To reprise, smarttable allows you to make delivery decisions based on sender! That’s totally antithetical to your run-of-the-mill sendmail admin, but it’s really useful… Like now. So I quickly put up a sendmail instance, copying a working config I use in production. But I changed the listener to IP address 127.0.0.2 (which I fortunately had already set up for some other reason I can no longer recall). That one’s pretty standard. That’s just:

DAEMON_OPTIONS(`Name=sm-cap, Addr=127.0.0.2')dnl

Of course you want to create a new queue directory just for the captured emails. I created /mqueue/c0 and put in this line into my .mc file:

define(QUEUE_DIR, `/mqueue/c*')dnl

And here’s the main point, how to defer delivery of all emails. Sendmail actually distinguishes between defer and queueonly. I chose queueonly thusly:

define(`confDELIVERY_MODE',`queueonly')dnl

If by chance you happen to misspell DELIVERY_MODE, like, let’s say, DELIERY_MODE, you don’t seem to get a whole lot of errors. Not that that would ever happen to us, mind you, I’m just saying. That’s why it’s good to also know about the command-line option. Keep reading for that.

It’s simple enough to test once you have it running (which I do with this line: sudo sendmail -bd -q -C/etc/mail/capture.cf).

> telnet 127.0.0.2 25
Trying 127.0.0.2…
Connected to 127.0.0.2.
Escape character is ‘^]’.
220 drj.com ESMTP server ready at Fri, 24 Feb 2012 15:16:40 -0500
helo localhost
250 drjemgw2.drj.com Hello [127.0.0.2], pleased to meet you
mail from: [email protected]
250 2.1.0 [email protected]… Sender ok
rcpt to: [email protected]
250 2.1.5 [email protected]… Recipient ok
data
354 Enter mail, end with “.” on a line by itself
subject: test of the capture-only sendmail instance

Just a test!
-Dr J
.

250 2.0.0 q1OKGet2008636 Message accepted for delivery
quit
221 2.0.0 drj.com closing connection
Connection closed by foreign host.

Is the message there, queued up the way we’d like? You bet:

> ls -l /mqueue/c0

total 16
-rw------- 1 root root  19 2012-02-24 15:17 dfq1OKGet2008636
-rw------- 1 root root 542 2012-02-24 15:17 qfq1OKGet2008636

There also seems to be a second way to run sendmail in queue-only fashion. I got it to work from the command-line like this:

> sudo sendmail -odqueueonly -bd -C/etc/mail/capture.cf

The book says this is deprecrated usage, however. But let’s see, that’s O’Reilly’s Sendmail 3rd edition, published in 2003, we’re in 2012, so, hmm, they still haven’t cut us off…

One last thing, that smarttable entry for my main sendmail daemon. I added the line:

@aicpa.org relay:[127.0.0.2]

Conclusion
It can be useful to queue all incoming emails for various reasons. It’s a little hard to find out how to do this precisely. We found a way to do this without stopping/starting our main sendmail process. This post shows a couple ways to do it, and why you might need to.

May 2012 Update
Just wanted to mention about BBB email how I handle it now. They told me they maintain an accurate SPF record. Sure enough, they do. Now we only accept bbb.org email when the SPF record is a match. But I don’t use sendmail for that, I use Postini’s (OK, Google’s, technically) mail hygiene service. Postini rocks!

My most recent post on how to tame the confounding sendmail log is here.

Categories
Admin Internet Mail Linux SLES

Building sendmail on SLES

Intro
My sendmail binary built for SLES 10 SP 3 was not working well at all on SLES 11 SP1. It became apparent that libraries were not compatible so it was time to re-compile. I’ve documented that journey here. There were a few pitfalls along the way so I felt it was worth a blog post should anyone else ever need to do this.

February 2013 update
And now I’ve repeated the journey on SLES 11 SP2 – and ran into new problems! I’ll put that story in the appendix below.

Why Build?
Why build sendmail when you can find a package for it? For security it’s a good idea to run the latest version. It’s easier to defend during an audit. So when I look via zypper, I see it proposes me sendmail v 8.14.3:

# sudo zypper if sendmail
Refreshing service 'nu_novell_com'.
Loading repository data...
Reading installed packages...
 
Information for package sendmail:
 
Repository: SLES11-SP1-Pool
Name: sendmail
Version: 8.14.3-50.20.1
Arch: x86_64
Vendor: SUSE LINUX Products GmbH, Nuernberg, Germany
...

I go to sendmail.org and find that the latest version is actually 8.14.5. And that’s fairly typical. The distributed release is over a year old.

Where this can really matter is when a vulnerability comes out. If you can roll your own you can be the first on the block with that vulnerability fixed – not putting yourself at the mercy of a vendor busy with hundreds of other distractions. And I have seen this phenomenon in action.

Distributing Your Build
I’m mixing up the order here.

Once you have your sendmail built, what’s the minimum set of things you need to put it on other servers?

For one, you gotta have a database package with sendmail. For historical reasons I use sleepycat (I think formerly known as Berkeley db). Only it was gobbled up by Oracle. I don’t have a feel for what the future holds, though I fear decline. Sleepycat provides db. I grabbed the RPM from rpmfind.net:

db-4.7.25-1rt.x86_64.rpm.

This package has to go on each server where you will run sendmail if you are using db for your database. First issue in trying to install this RPM:

# sudo rpm -i db-4.7.25-1rt.x86_64.rpm
        file /usr/bin/db_archive from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_checkpoint from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_deadlock from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_dump from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_hotbackup from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_load from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_printlog from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_recover from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_stat from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_upgrade from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_verify from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64

I don’t really know if any of these conflicting files are used by the system. I also don’t know how to install “all but the conflicting files” in RPM. So we’ll try our luck and simply overwrite them:

# sudo rpm -i –force db-4.7.25-1rt.x86_64.rpm

Now no errors are reported.

We gotta make another kludge. sendmail, or at least the version I compiled, needs this db library in /usr/lib64, but the RPM puts it in /usr/lib. So…

# cd /usr/lib64; sudo ln -s ../lib/libdb-4.7.so

By the way, how did I decide that db has to be brought over? I fired up sendmail and got this error:

# sendmail
sendmail: error while loading shared libraries: libdb-4.7.so: cannot open shared object file: Error 40

Now with the sym link made I get a more pleasant application error:

# sendmail
can not chdir(/var/spool/clientmqueue/): Permission denied
Program mode requires special privileges, e.g., root or TrustedUser.

Continuing my out-of-order documentation(!), I create an init script in /etc/init.d and make sure sendmail is going to be started at boot:

# sudo chkconfig -s drjohnssendmail 35

I like to put the logs in /maillog:

# sudo rm /var/log/mail; sudo mkdir !$; sudo ln -s !$ /maillog

I like to have the logging customized a bit, so I modify syslog-ng.conf somewhat:

# DrJ attempt to define filter based on match of sm-mta
filter f_sm-mta     { match("sm-mta"); };
filter f_fs-mta     { match("fs-mta"); };

and

...
#destination mail { file("/var/log/mail"); };
#log { source(src); filter(f_mail); destination(mail); };
 
destination mailwarn { file("/var/log/mail/mail.warn" perm(0644)); };
log { source(src); filter(f_mailwarn); destination(mailwarn); };
 
destination mailerr  { file("/var/log/mail/mail.err" perm(0644) fsync(yes)); };
log { source(src); filter(f_mailerr);  destination(mailerr); };
 
#
# and also all in one file:
#
#
# and also all in one file:
#
destination mail { file("/var/log/mail/stat.log" perm(0644)); };
log { source(src); filter(f_sm-mta); destination(mail); };
log { source(src); filter(f_fs-mta); destination(mail); };

Followed by a

# sudo service syslog restart

Going Back to Compiling
So how did we compile sendmail in the first place, which was supposed to be the subject of this blog?

We downloaded the latest version from sendmail.org and unpacked it. Then we read the INSTALL file in the sendmail-8.14.5 directory for general guidance about the steps.

To make our ocmpilation configuration portable, we try to encapsulate our idiosyncracies in one file, devtools/Site/site.config.m4, which we create. Mine looks as follows:

dnl  DrJ config file for corporate mail delivery
dnl I am leaving out the ldap stuff because we stopped using it
dnl 
dnl which maps we will support - NEWDB is automatic if it finds the db libs
dnl in Linux NDBM is really GDBM, which isn't supported.  NEWDB support is not automatic
define(`confMAPDEF',`-DNEWDB -DMAP_REGEX')
dnl Berkeley DB is here...
dnl this doesn't work, exactly - put the db libs directly into /usr/lib
APPENDDEF(`confLIBDIRS',`-L/usr/local/ssl/lib -L/usr/lib64')
dnl libdb-4 looks like the sleepycat library on Linux
APPENDDEF(`confLIBS',`-ldb-4')
APPENDDEF(`confINCDIRS',`-I/opt/local/include -I/usr/local/ssl/include')
dnl where to put smrsh and mail.local programs
define(`confEBINDIR', `/opt/mail/bin')
dnl where to install include files
define(`confINCLUDEDIR', `/opt/mail/include')
dnl where to install library files
define(`confLIBDIR', `/opt/mail/lib')
dnl man pages
define(`confMANROOT', `/opt/local/man/cat')
dnl unformatted man pages
define(`confMANROOTMAN', `/opt/local/man/man')
dnl the sendmail binary goes into MBINDIR
define(`confMBINDIR', `/opt/mail/bin')
dnl programs only executed by root go to sbin
define(`confSBINDIR', `/opt/mail/sbin')
dnl shared library directory
define(`confSHAREDLIBDIR', `/opt/mail/lib')
dnl user-executable pgms, newaliases, mailq, hoststat, etc
define(`confUBINDIR', `/opt/mail/bin')
dnl TLS support 
APPENDDEF(`conf_sendmail_ENVDEF', `-DSTARTTLS')
APPENDDEF(`conf_sendmail_LIBS', `-lssl -lcrypto')

I’ll explain a bit of this file since of course it is critical to what we are trying to do. I arrived at its current state from a little experimentation, so I don’t know the full explanation of some of the settings. What it’s saying is that we use the NEWDB, which uses that db we spoke of earlier for our maps. I like to install the binaries into /opt/mail/bin. We like to have the option to run TLS.

With that set we run the compile:

# cd sendmail; sh ./Build

and it spits out some errors to the screen which indicate we’re missing some SSL headers. We get them with:

# sudo zypper source-install openssl

Now with any luck it will fully compile.

At this point it helps to create some of the target directories:

# sudo mkdir -p /opt/mail/{bin,sbin} /opt/local/man/cat{1,5,8}

And we create a user account, smmsp, with uid and gid of 225, and a group with the same name.

And then we can install it:

# sudo sh ./Build install

The install should work, mostly. But makemap doesn’t get put in /opt/mail for some reason. So you have to copy it by hand from sendmail-8.14.5/obj.Linux.2.6.32.12-0.7-default.x86_64/makemap to /opt/mail/sbin, for instance. You really need to have a makemap.

Finally, I suggest to recursively copy the cf directory:

# cp -r sendmail-8.14.5/cf /opt/mail

This way you have a pretty relocatable set of files under /opt/mail.

Appendix: Building on SLES 11 SP2
I thought this would be a breeze. Just look at my own blog posting above! Not so fast – that approach didn’t work at all.

You have to appreciate that under SLES 11 SP1 we needed a few key packages that aren’t very common:

– libopenssl-devel-0.9.8h-30.27.11
– zlib-devel-1.2.3-106.34

We pulled them from the SDK DVD. Well, turns out there is no SDK DVD for SLES 11 SP2! Novell, probably in one of those beloved cost-saving measures, no longer releases an SDK DVD. What to do?

Well, I found what not to do. I began copying key files from my SLES 11 SP1 installation, like libcrypto.a, libssl.a, /usr/include/ssl. This all helped to reduce the number of errors. But at the end of the day there was an error I couldn’t chase away no matter what:

/usr/src/packages/BUILD/openssl-0.9.8h/crypto/comp/c_zlib.c:235: undefined reference to `inflate'

Meanwhile the resourceful sysadmin found those development packages. He told me about SUSE Studio, which allows you to build your own distribution. He looked for those development packages in the distribution, found and installed them:

$ rpm -qa|grep devel

libopenssl-devel-0.9.8j-0.26.1
zlib-devel-1.2.3-106.34
zlib-devel-32bit-1.2.3-106.34

Then my build went through fine. Whew!

Conclusion
I could go on and on about details in the setup, but the scope here is the compilation, and we’ve covered that pretty well.

Once you get over the pain of compilation setup, sendmail runs great and is a great MTA.