Admin Network Technologies

Postfix Operational tips

I’m trying out the system-supplied postfix on a SLES system. i had been using sendmail but there doesn’t seem to be any development on that software.

Some commands I needed right away
Well, right away I had thousands of queued messages so I needed a way to make sense of what was happening.

For these commands to make sense you need to know that I am running a second postfix configuraiton out of /etc/postfixEXT.

Display the queue

postqueue -c /etc/postfixEXT -p

Force delivery from the queue

postqueue -c /etc/postfixEXT -f

List one email in detail

postcat -vq -c /etc/postfixEXT QUEUEID

Delete one email

postsuper -c /etc/postfixEXT -d QUEUEID

Put mail on hold

postsuper -c /etc/postfixEXT -h ALL|QUEUEID

Release mail form hold

postsuper -c /etc/postfixEXT -H ALL|QUEUEID

How to force delivery of a single message
This command is not documented anywhere – because it doesn’t exist so you have to get creative. If you have the luxury of halting all email for a few seconds simply do this:

Display the queue to find the queue ID of the email you want to force delivery of

postqueue -c /etc/postfixEXT -p

Put all mail on hold

postsuper -c /etc/postfixEXT -h ALL

Now release the hold on that one email

postsuper -c /etc/postfixEXT -H QUEUEID

QUEUEID is, of course, the queue id .e.g., F2A1A27891E, of the email in question.

Look for what happened
Check your mail log’s last lines in /var/log/mail

Revert back to normal running

postsuper -c /etc/postfixEXT -H ALL

Since mail is store-and-forward and not real time, you can do these steps, quickly, even on a production system and no one will be the wiser if you are pretty quick. Probably takes two minutes even for a slow typer.

How to run multiple listeners
I didn’t want to disturb the system-installed postfix too much. I would let it “have” the loopback address,, leaving me the other IPs for my relay config to listen on. I added these lines to /etc/postfix/

multi_instance_enable = yes
multi_instance_directories = /etc/postfixEXT

service postfix start starts up the local postfix plus my relay. Grep the process table for either master or postfix to see. However, to be honest, service postfix stop does not kill all processes. So I always end up killing one of the master processes by hand. Update: postmulti -p stop does the trick to kill all. There is also a status or start option instead of stop.

Sendmail to Postfix migration tips
This could be a separate post but I am too lazy to do that.

What happens to the access file? I kept the name of the file access but just list all the IPs, one per line, without any further arguments, to permit just those IPs relay access. In my I have a line like this to tie it together:

mynetworks = /etc/postfixEXT/access

Note that there is no hashed or .db version of this file any longer, unlike in the sendmail case.

References and related
Since I mentioned sendmail I have to give a shout out to one of my old sendmail posts.

More info on postfix multiple instances. A pretty complete guide.

Admin Internet Mail

Analysis of a spam campaign and how we managed to fight back for a few days

A long-running spam campaign has been bothering me lately. In this post I analyze it from a sendmail perspective and provide a simple script I wrote which helped me fight back.

The details
Let’s have a look see at the July 3rd variant of this spam. Although somewhat different from the previous campaigns in that this did not provide users with a carefully phished email to their inbox, from a sendmail perspective it had a lot of the same features.

So the July 3rd spam was a spoof of Marriott. Look at these from lines. They pretty much shout the pattern out:

Jul  3 14:12:20 drjemgw sm-mta[4707]: r63IA8dJ004707:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-m
ta, []
Jul  3 14:12:22 drjemgw sm-mta[7088]: r63ICDA7007088:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-mta, r []
Jul  3 14:12:23 drjemgw sm-mta[7220]: r63ICIhL007220:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-m
ta, []
Jul  3 14:12:24 drjemgw sm-mta[7119]: r63ICEp6007119:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm
-mta, []
Jul  3 14:12:33 drjemgw sm-mta[7346]: r63ICO8H007346:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-mta, []
Jul  3 14:12:34 drjemgw sm-mta[7425]: r63ICTsI007425:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-
mta, []
Jul  3 14:12:35 drjemgw sm-mta[7387]: r63ICRMP007387:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=s
m-mta, []
Jul  3 14:12:39 drjemgw sm-mta[1757]: r63I7dfa001757:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-mta, re []
Jul  3 14:12:40 drjemgw sm-mta[6643]: r63IBpYm006643:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-mta, relay=e []
Jul  3 14:12:42 drjemgw sm-mta[4894]: r63IAFug004894:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-mta, []
Jul  3 14:12:43 drjemgw sm-mta[7573]: r63ICZJq007573:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-mt
a, []
Jul  3 14:12:45 drjemgw sm-mta[7698]: r63ICfP9007698:, size=0, class=0, nrcpts=1, proto=SMTP, daemon
=sm-mta, []
Jul  3 14:12:46 drjemgw sm-mta[7610]: r63ICblx007610:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-
mta, []
Jul  3 14:12:50 drjemgw sm-mta[7792]: r63ICl6Y007792:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-mta, rela []
Jul  3 14:12:51 drjemgw sm-mta[6072]: r63IBGCU006072:, size=0, class=0, nrcpts=1, proto=SMTP, d
aemon=sm-mta, []
Jul  3 14:12:51 drjemgw sm-mta[7549]: r63ICYnm007549:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-mta, rela []
Jul  3 14:12:55 drjemgw sm-mta[7882]: r63ICrUW007882:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-mta, r []
Jul  3 14:12:57 drjemgw sm-mta[7925]: r63ICtav007925:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-mta, relay=e []
Jul  3 14:12:57 drjemgw sm-mta[7930]: r63ICu5c007930:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-
mta, []
Jul  3 14:12:58 drjemgw sm-mta[7900]: r63ICsOE007900:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-mta, relay=eu1 []
Jul  3 14:13:00 drjemgw sm-mta[7976]: r63ICwmu007976:, size=0, class=0, nrcpts=1, proto=SMTP, daemon=sm-mta, relay= []

Of 2035 of these that were sent out, 192 were delivered, meaning got into the users inbox past Postini’s anti-spam defenses. So that’s a pretty high success rate as spam goes. And users get concerned.

Now look at sendmail’s access file which I created shortly after becoming aware of similar phishing of more recently on July 11th:


You get the idea.

What I noticed in these campaigns is a wide variety of subdomains of the domain being phished, with and without “mail” attached to the domain. In particular some rather peculiar-looking subdomains such as complains and emalsrv. So I realized that instead of waiting for me to get the spam, I can constantly comb the log file for these peculiar subdomains. If I come across a new one, voila, it means a new spam campaign has just started! And I can send myself an alert so I can decide – by hand – how best to treat it, knowing it will generally follow the pattern of the recent campaigns.

Now here’s the script I wrote to catch this type of pattern early on:

# DrJ, 7/2013
# I keep my sendmail log file here in /maillog/stat.log and cut it daily
$sl = "/maillog/stat.log";
# 10000 lines occurs in about eight minutes
$DEBUG = 0;
$i = 0;
$lastlines = "-10000";
$access = "/etc/mail/access";
open(ACCESS,$access) || die "Cannot open $access!!\n";
@access = <ACCESS>;
open(SL,"/usr/bin/tail $lastlines $sl|") || die "cannot run tail $lastlines on $sl!!";
print "anti-spam domain: ";
  ($domain) = /from=\w{1,25}@(?:emalsrv|complains)\.([^\.]+)\./;
  if ($domain) {
# test if we already have it on our access table
    $seenit = 0;
    foreach $line (@access) {
# lazy, inaccurate match, but good enough...
      $seenit = 1 if $line =~ /$domain/;
      print "seenit, domain, line: $seenit, $domain, $line\n" if $DEBUG;
    if (! $seenit) {
      print "$domain\n" if $i == 1;

I call the script I invoke every couple minutes from HP SiteScope. There I have alerts set up which email me a brief message that includes the new domain that is being phished.

No sooner had I implemented this script than it went off and told me about that linkedin phishing spam campaign! That was sweet.

Recent campaigns
Here is a chronology of spam campaigns which follow the pattern documented above. They seem to cook them up one per day.

5/16 - their misspelling, not mine!
date uncertain
7/17 - this one changed up the pattern a bit
7/18 - again - with somewhat new pattern
a bit more, a smattering of and
tapering off...
7/30 and later
spammer seems to have gone on hiatus, or finally been arrested
10/2, they're back

One example spam
Here was my phishing spam from 3M which I got yesterday:

From: "3M" <>
To: DrJ
This is an automated e-mail.
This account is not reviewed for responses.
This email is to confirm that on 07/17/2013, 3M's bank (JP Morgan) has debited $15,956.64 from your bank account.
If you have any questions, please visit the 3M EIPP Helpline at this link.

The HTML source for that last line looks like this:

If you have any questions, please visit the 3M EIPP=
		 		 Helpline <a href=3D"
ml?help">at this link.</div>

When I checked Bluecoat’s K9 webfilter, which I even use at home, the URL in the link, vlayaway… was not rated. I submitted a suggest category, Malicious Sources, and they efficiently assigned it that category within minutes of my submission.

Also, note that the envelope sender of my email differs from the Sender header. The envelope sender was

A word about DISCARD vs ERROR
While I’m waiting for more spam of this sort to come in as I write this on July 22nd, I had a brainstorm. Rather than DISCARDind these emails, which doesn’t tip the sender off, it’s probably better to send a 550 error code, which rattles the system a bit more. I think a sending IP with too many of these errors will be temporarily banned by Postini for all their users. So I changed all my DISCARDs. Here is the syntax for one example line: ERROR:"550 Sender banned. Please use legitimate domain to send email."

I originally wanted to put the message “No such user,” to try to get the spammer to take that specific recipient off their spam list, but it doesn’t really work in the right way: the error is reported in the context of the sender address, not the recipient address.

Here is the protocol which shows what I am talking about:

$ telnet 25
Connected to
Escape character is '^]'.
220 Postini ESMTP 133 y678_pstn_c6 ready.  CA Business and Professions Code Section 17538.45 forbids use of this system for unsolicited electronic mail advertisements.
helo localhost
250 Postini says hello back
mail from:
250 Ok
rcpt to:
550 5.0.0 Sender banned. Please use legitimate domain to send email. - on relay of: mail from:
221 Catch you later

So that – on relay of: mail from: … is added by Postini so it really doesn’t make sense to say No such user in that context.

My satisfaction may be short-lived. But it is always sweet to be on top, even for a short while.

For a lighthearted discussion of HP SiteScope, read the comments from this post.

Sendmail is discussed in various posts of mine. For instance, Analyzing the sendmail log, and Obscure tips for sendmail admins.

Internet Mail

Analyzing the sendmail log

If you’ve read any of my posts you’ll see that I believe sendmail is a technical marvel, but that’ snot to say it’s without its flaws.

One of the annoying things is that the From line and To line are recorded spearately, in defiance of common logic. I present a simple program I wrote to join these lines.

The details

Without further ado, here is the program, which I call

# combine lines from stat.log
# Copyright work under the Artistic License,
# DrJ 6/2013 - make more readable based on this format:
# Date|Time|Size[Bytes]|Direction|Sender|Recipient|Relay-MTA
# From= usually has address surrounded by <>, but not always
# input of
#Jun 20 10:00:21 drjemgw sm-mta[24318]: r5KE0K1U024318:, size=5331, class=0, nrcpts=1, msgid=<15936355.7941275772268.JavaMail.easypaynet@Z32C1GBSZ4AN2>, proto=SMTP, daemon=sm-mta, []
#Jun 20 10:00:22 drjemgw sm-mta[24367]: r5KE0K1U024318: to=<>, delay=00:00:02, xdelay=00:00:01, mailer=esmtp, pri=125331, [], dsn=2.0.0, stat=Sent (r5KE0M6E027784 Message accepted for delivery)
# produces
use Getopt::Std;
getopts('s:f'); # -s search_string -f for "full" version of output
$DEBUG = 0;
print "$relay{$ID}, $lines{$ID}, $sender{$ID}, $size{$ID}\n";
$year = `date +%Y`;
while(<>) {
  print $_ ."\n" if $DEBUG;
# get ID
  ($ID) = /\[\d{2,10}\]:\s+(\w+):\s+/;
#print "ID: $ID\n";
  if ($lines{"$ID"} && / stat=Sent /) {
    if ($opt_f) {
      $lines{"$ID"} .= '**to**line**'.$_;
    } else {
      ($recip,$relay) = /:\sto=<(.+)>,\s.*\srelay=(\S+)\s/;
# there can be multiple recipients listed
      $recip =~ s/[\<\>]//g;
# disposition of email.  This needs customization for your situation, but it only determines IN vs OUT vs INTERNAL so it's not critical...
# In this example coding we get all our inbound email from, and outbound mail comes from drjinternal
      if ($relay{$ID} =~ /postini\.com/) {
        $disp = "IN";
      } else {
        $disp = $relay =~ /drjinternal/ ? "INTERNAL" : "OUT";
      $lines = "$lines{$ID}|$size{$ID}|$disp|$sender{$ID}|$recip|$relay{$ID}";
      if ( ($lines =~ /$opt_s/ || ! $opt_s) && ($sender{$ID} || $recip) ) {
        $lines .= "|$ID" if $DEBUG;
#        push @lines, $lines; # why bother?  just spit it out immediately
         print "$lines\n";
# save memory, hopefully? - can't do this. sometimes we have multiple To lines
#      undef $relay{$ID}, $lines{$ID}, $sender{$ID}, $size{$ID};
      print "$recip\n" if $DEBUG;
  } else {
    if ($opt_f) {
      $lines{"$ID"} .= '**from**line**'.$_;
    } else {
      ($mon,$date,$time,$sender,$size,$relay) = /^(\w+)\s+(\d+)\s+([\d:]+)\s.+\sfrom=<?([^<][^,]*\w)>?,\ssize=(\d+).*relay=(\S+)/;
# convert month name to month number
      $monno = index('JanFebMarAprMayJunJulAugSepOctNovDec',$mon)/3 + 1;
# the year is faked - it's not actually recorded in the logs so we assume it's the current year...
      $lines{$ID} = "$date.$monno.$year|$time";
      $size{$ID} = $size;
      $sender{$ID} = $sender;
      $relay{$ID} = $relay;
      print "$mon,$date,$time,$sender,$size,$relay\n" if $DEBUG;
# now start matching
if ($opt_f) {
  foreach (@lines) {
    print $_."\n"

What it does is combine the From and To lines based on the message ID which is unique to a message.

I usually use it to suck in an entire day’s log (I call my sendmail log stat.log) and grep the output to look for a particular string. For instance today there was a spam blast where ADP’s identity was phished. The sending domains all contained some variant of adp:,,,, etc. So I wanted to find the answer to the question who’s received any of these ADP phishing emails today? Here’s how you use the program, to do that:


The input lines look like this:

Jun 20 10:00:21 drjemgw sm-mta[24318]: r5KE0K1U024318:, size=5331, class=0, nrcpts=1, msgid=<15936355.7941275772268.JavaMail.easypaynet@Z32C1GBSZ4AN2>, proto=SMTP, daemon=sm-mta, []
Jun 20 10:00:22 drjemgw sm-mta[24367]: r5KE0K1U024318: to=<>, delay=00:00:02, xdelay=00:00:01, mailer=esmtp, pri=125331, [], dsn=2.0.0, stat=Sent (r5KE0M6E027784 Message accepted for delivery)

The output from looks like this:


Yeah, I got that ADP spam by the way…

A useful Perl script has been presented which helps mail admins combine separate output lines into a single entry, preserving the most important meta-data from the mail.

Other interesting sendmail posts are also available here and here.

Admin Internet Mail SLES

The IT Detecive Agency: emails began piling up this week, no obvious cause

Today I had my choice of problems I could highlight, but I like this one the best. Our mail server delivers email to a wide variety of recipients. All was going well and it ran pretty much unattended until this week when it didn’t go so well. Most emails were getting delivered, but more and more were starting to pile up in the queues. This is the story of how we unraveled the mystery.

The details
It’s best to work from examples I think. I noticed emails to were being refused delivery as well as emails to The latter is a smaller company so we heard from them the usual story that we’re the only ones who can’t send to them.

So I forced delivery with verbose logging. I’m running sendmail, so that looks like this:

> sendmail -Cconfig_file -v

That didn’t work out, producing a no route to host type of error. I did a DNS lookup by hand. That showed one set of results, while sendmail was connecting to an entirely different IP address. How could that be??

I was at a loss so I do what I do when I’m desperate: strace. That looks like this:

> strace -f sendmail -Cconfig_file -v > /tmp/strace 2>&1

That produced 12,000 lines of output. All the system calls that the process and any of its forked processes invoke. Is that too much to comb through by hand? No, not at all, not when you begin to see the patterns.

I pored over the trace, not knowing what most of it meant, but looking for especially any activity regarding networking and DNS. Around line 6,000 I found it. There was mention of nscd.

For the unaware the use of nscd (nameserver caching daemon) might seem innocent enough, or even good-intentioned. What could be wrong with caching frequently used DNS results? The only issue is that it doesn’t work right! nscd derives from UC Berkeley Unix code and has never been supported. I didn’t even like it when I was running SunOS. It caches the DNS queries but ignores TTLs. This is fatal for mail servers or just about anything you can think of, especially on servers that are infrequently booted as mine are.

I stopped nscd right away:

> service nscd stop

and re-ran the sendmail queue runner (same command as above). The emails flowed out instantly! Soon hundreds of stuck emails were flushed out.

Of course for good measure nscd had to be removed from the startup sequence:

> chkconfig nscd off

An IT pro always keeps unsolved mysteries in his mind. This time I knew I also had in hand the solution an earlier-documented mystery about email to

nscd might show up in your SLES or OpenSuse server. I strongly suggest to disable it before you wind up with old DNS values and an extremely hard-to-debug issue.

Case closed!

Admin DNS Internet Mail SLES

Strange problem with email to

This is probably the most obscure of all postings I will ever do – it’s really just opening up my private journal to the Internet, which helps me when I need to recall how I fixed something.

So the story is that I’m having trouble sending email to anyone in the domain, and I just couldn’t figure out why.

The details
With my sendmail config I finally rolled up my sleeves, and did some debugging, even though I am pressed for time. Start up our sendmail debugging session:

> sendmail -bt -d35.9

This produces a lot of blah, blah, configuration settings, blah, blah, and finally a sort of sendmail debugging shell. So let’s test a good “normal” domain:

> 3,0

canonify           input: test @ gmail . com
Canonify2          input: test < @ gmail . com >
Canonify2        returns: test < @ gmail . com . >
canonify         returns: test < @ gmail . com . >
parse              input: test < @ gmail . com . >
Parse0             input: test < @ gmail . com . >
Parse0           returns: test < @ gmail . com . >
ParseLocal         input: test < @ gmail . com . >
ParseLocal       returns: test < @ gmail . com . >
Parse1             input: test < @ gmail . com . >
Mailertable        input: < gmail . com > test < @ gmail . com . >
Mailertable        input: gmail . < com > test < @ gmail . com . >
Mailertable      returns: test < @ gmail . com . >
Mailertable      returns: test < @ gmail . com . >
SmartTable         input: test < @ gmail . com . >
SmartTable       returns: test < @ gmail . com . >
MailerToTriple     input: < > test < @ gmail . com . >
MailerToTriple   returns: test < @ gmail . com . >
Parse1           returns: $# esmtp $@ gmail . com . $: test < @ gmail . com . >
parse            returns: $# esmtp $@ gmail . com . $: test < @ gmail . com . >

and then this problem domain:

> 3,0

canonify           input: test @ paladinny . com
Canonify2          input: test < @ paladinny . com >
Canonify2        returns: test < @ paladinny . no-ip . biz . >
canonify         returns: test < @ paladinny . no-ip . biz . >
parse              input: test < @ paladinny . no-ip . biz . >
Parse0             input: test < @ paladinny . no-ip . biz . >
Parse0           returns: test < @ paladinny . no-ip . biz . >
ParseLocal         input: test < @ paladinny . no-ip . biz . >
ParseLocal       returns: test < @ paladinny . no-ip . biz . >
Parse1             input: test < @ paladinny . no-ip . biz . >
Mailertable        input: < paladinny . no-ip . biz > test < @ paladinny . no-ip . biz . >
Mailertable        input: paladinny . < no-ip . biz > test < @ paladinny . no-ip . biz . >
Mailertable        input: paladinny . no-ip . < biz > test < @ paladinny . no-ip . biz . >
Mailertable      returns: test < @ paladinny . no-ip . biz . >
Mailertable      returns: test < @ paladinny . no-ip . biz . >
Mailertable      returns: test < @ paladinny . no-ip . biz . >
SmartTable         input: test < @ paladinny . no-ip . biz . >
SmartTable       returns: test < @ paladinny . no-ip . biz . >
MailerToTriple     input: < > test < @ paladinny . no-ip . biz . >
MailerToTriple   returns: test < @ paladinny . no-ip . biz . >
Parse1           returns: $# esmtp $@ paladinny . no-ip . biz . $: test < @ paladinny . no-ip . biz . >
parse            returns: $# esmtp $@ paladinny . no-ip . biz . $: test < @ paladinny . no-ip . biz . >

I have to look more into what Canonify2 does. But this gives me an idea: force the mailertable to handle paladinny . no-ip . biz the way I want it to, namely:

because in DNS my DNS server returns this funny result:

> dig mx

; <<>> DiG 9.6-ESV-R7-P3 <<>> mx
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17559
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 0
;                 IN      MX
;; ANSWER SECTION:          351     IN      CNAME
;; AUTHORITY SECTION:              60      IN      SOA 2052775595 600 300 604800 600
;; Query time: 30 msec
;; WHEN: Fri Jan 18 08:53:49 2013
;; MSG SIZE  rcvd: 121

whereas Google’s public DNS says this, which looks like the intended result:

> dig mx @

; <<>> DiG 9.6-ESV-R7-P3 <<>> mx @
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3749
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;                 IN      MX
;; ANSWER SECTION:          1800    IN      MX      10
;; Query time: 236 msec
;; WHEN: Fri Jan 18 08:55:42 2013
;; MSG SIZE  rcvd: 71

So at least we know where that odd comes from, sort of. It comes from my nameserver, but where it got that answer from I have no idea. It doesn’t come from the authoritative nameservers:

> dig mx

; <<>> DiG 9.6-ESV-R7-P3 <<>> mx
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45704
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;                 IN      MX
;; ANSWER SECTION:          1800    IN      MX      10
;; Query time: 82 msec
;; WHEN: Fri Jan 18 08:59:50 2013
;; MSG SIZE  rcvd: 71

A CNAME is not an MX record, so why my nameserver is returning an answer (ANSWER: 1)when queried for the MX record when all it thinks it has is a CNAME seems to be an out-and-out error.

And putting the resolved name in the mailertable is also not normal. Normally you put the domain itself, as in:

and of course that’s the first thing I tried, but it has no effect whatsoever.

February Update and Conclusion
The mystery was solved when a whole bunch of email deliveries started failing on my system and I was forced to do some serious debugging. Long story short my SLES system was regrettably running nscd, the nameserver caching daemon. I didn’t even bother to check So many other things cleared up when I killed it I’m sure it was the cause of the issue as well. This is all described in this post.

Admin Internet Mail Linux Network Technologies

The IT Detective Agency: mail server went down with an old-school problem

I got a TXT from my monitoring system last night. I ignored it because I knew that someone was working on the firewall at that time. I’ve learned enough about human nature to know that it is easy to ignore the first alert. So I’ve actually programmed HP SiteScope alerts to send additional ones out after four hours of continuous errors. When I got the second one at 9 PM, I sprang into action knowing this was no false alarm!

The details
Thanks to a bank of still-green monitors I could pretty quickly rule out what wasn’t the matter. Other equipment on that subnet was fine, so the firewall/switch/router was not the issue. Then what the heck was it? And how badly was it impacting mail delivery?

This particular server has two network interfaces. Though the one interface was clearly unresponsive to SMTP, PING or any other protocol, I hadn’t yet investigated the other interface, which was more Internet-facing. I managed to find another Linux server on the outer network and tried to ping the outer interface. Yup. That worked. I tried a login. It took a whole long time to get through the ssh login, but then I got on and the server looked quite normal. I did a quick ifconfig – the inner interface listed up, had the right IP, looked completely normal. I tried some PINGs from it to its gateway and other devices on the inner network. Nothing doing. No PINGs were returned.

I happened to have access to the switch. I thought maybe someone had pulled out its cable. So I even checked the switch port. It showed connected and 1000 mbits, exactly like the other interface. So it was just too improbable that someone pulled out the cable and happened to plug another cable from another server into that same switch port. Not impossible, just highly improbable.

Then I did what all sysadmins do when encountering a funny error – I checked the messages file in /var/log/messages. At first I didn’t notice anything amiss, but upon closer inspection there was one line that was out-of-place from the usual:

Nov  8 16:49:42 drjmailgw kernel: [3018172.820223] do_IRQ: 1.221 No irq handler for vector (irq -1)

Buried amidst the usual biddings of cron was a kernel message with an IRQ complaint. What the? I haven’t worried about IRQ since loading Slackware from diskettes onto my PC in 1994! Could it be? I have multiple ways to test when the interface died – SiteScope monitoring, even the mail log itself (surely its log would look very different pre- and post-problem.) Yup.

That mysterious irq error coincides with when communication through that interface stopped working. Oh, for the record it’s SLES 11 SP1 running on HP server-class hardware.

What about my mail delivery? In a panic, realizing that sendmail would be happy as a clam through such an error, I shut down its service. I was afraid email could be piling up on this server, for hours, and I pride myself in delivering a faultless mail service that delivers in seconds, so that would be a big blow. With sendmail shut down I knew the backup server would handle all the mail seamlessly.

This morning, in the comfort of my office I pursued the answer to that question What was happening to my mail stream during this time? I knew outbound was not an issue (actually the act of writing this down makes me want to confirm that! I don’t like to have falsehoods in writing. Correct, I’ve now checked it and outbound was working.) But it was inbound that really worried me. Sendmail was listening on that interface after all, so I didn’t think of anything obvious that would have stopped inbound from being readily accepted then subsequently sat on.

But such was not the case! True, the sendmail listener was available and listening on that external interface, but, I dropped a hint above. Remember that my ssh login took a long while? That is classic behaviour when a server can’t communicate with its nameservers. It tries to do a reverse lookup on the ssh client’s source IP address. It tried the first nameserver, but it couldn’t communicate with it because it was on the internal network! Then it tried its next nameserver – also a no go for the same reason. I’ve seen the problem so often I wasn’t even worried when the login took a long time – a minute or so. I knew to wait it out and that I was getting in.

But in sendmail I had figured that certain communications should never take a long time. So a long time ago I had lowered some of the default timeouts. My mc file in the upstream server contains these lines:

dnl Do not use RFC1413 identd. p 762  It requires another whole in the F/W
dnl Set more reasonable timeouts for SMTP commands'

I now think that in particular the HELO timeout (TO_HELO) of 60 seconds saved me! The upstream server reported in its mail log:

Timeout waiting for input from drjmailgw during client greeting

So it waited a minute, as drjmailgw tried to do a reverse lookup on its IP, unsuccessfully, before proceeding with the response to HELO, then went on to the secondary server as per the MX record in the mailertable. Whew!

More on that IRQ error
Let’s go back to that IRQ error. I got schooled by someone who knows these things better than I. He says the Intel chipset was limited insofar as there weren’t enough IRQs for all the devices people wanted to use. So engineers devised a way to share IRQs amongst multiple devices. Sort of like virtual IPs on one physical network interface. Anyways, on this server he suspects that something is wrong with the multipath driver which is loaded for the fiber channel host adapter card. In fact he noticed that the network interface flaked out several times previous to this error. Only it came back after some seconds. This is the server where we had a very high CPU when the SAN was being heavily used. The SAN vendor checked things out on their end and, of course, found nothing wrong with the SAN equipment. We actually switched from SAN to tmpfs after that. But we didn’t unload the multipath driver. Perhaps now we will.

Feb 22 Update
We haven’t seen the problem in over three weeks now. See my comments on what actions we took.

Persistence, patience and praeternatural practicality paid off in this perplexing puzzle!

Admin Internet Mail Linux

Yahoo! stopped accepting emails – what we can learn about sendmail from this

Flash Update
I noticed, and had stopped accepting emails today.

I wanted to provide insight into this problem that few others would have, namely, when did the problem first occur?

I looked at my stuck messages in queue. I realized that with sendmail, there is no easy way to answer the question: what is the oldest stuck message for domain XYZ in the queue? So I rigged up something, very crude, but typical for a Unix command line-type of guy.


$ grep yahoo */qf*|grep for|cut -d\; -f2|sort|head

The answer? My first stuck message comes from 12:43:46 EST today (July 31st). As of this writing, 3:53 PM, the problem still exists, which already makes it a quite long outage even if it is fixed in the next few minutes. Just minutes later, around 4 PM, I noticed a lot of the messages being delivered, so the problem seems to have finally cleared up.

The expression above works if you are root and the current directory is, e.g., /mqueue, under which you have queue directories q0, q1, …, q9, etc corresponding to an MC statement:


The grep above might miss a few messages, but if you have lots of them as I do, it doesn’t really matter as it’s purpose is just to convey the general idea of when the problem started. In my case it can be safely assumed that I am continuously sending emails to, so it is not possible to have a window that isn’t covered of more than ten minutes or so during the day.

We helped document a complete three-hour outage of Yahoo mail. Along the way we learned of a deficiency in sendmail’s mailq command – how limited its reporting options are. We compensated for that by rolling our own series of commands to answer the question of what is the oldest mail of this type in our queues.


54 Popular sendmail Features

Thinking of replacing sendmail? Or switching to sendmail? Here are 54 features I find useful in the way I implement sendmail.

In poorly ordered listing, we have:

– minimal acceptable delivery speed: 100 messages/sec
– queue deletion after 3 days (customizable)
– customizable timers on:
– time to wait for initial connection
– time to wait for response to MAIL command
– time to wait for response to QUIT
– time to wait for response to RCPT TO command
– time to wait for response to RSET command
– time to wait for response to other SMTP commands
– ability to turn off identd usage
– customizable greeting
– ability to deliver local mail for error situations such as looping mail, invalid sender + invalid recipient
– ability to detect looping messages and log and remove them
– errors in MIME format
– configurable maximum message size
– configurable maximum number of recipients per message
– configurable minimum queue age before delivery is re-tried
– configurable address operator characters
– ability to set multiple names for this host
– support for alias address transformations
– support for domain aliasing
– configurable load average at which new messages are refused
– configurable load average at which new messages are queued for later delivery
– configurable load average at which SMTP responses are delayed

– ability to run TLS as server and client, and use a CA-issued certificate
– use of fast table lookups to efficiently handle tables with thousands of entries
– configurable mail relaying decisions based on recipient domain
– ability to turn off UUCP routing
– ability to avoid canonification of recipient domain
– ability to re-write sender address
– ability to make mail relaying decisions based on sender address as well as sender domain
– ability to allow only selected domains/IPs/subnets to relay mail
– ability to reject messages to specified recipients/domains with custom message
– ability to silently discard messages to specific recipients/domains
– ability to discard or reject messages from specific senders or sender domains
– ability to set custom error number for rejected email
– support for mass-import and mass alteration of table entries (e.g., to mail routing/access/alias lists)
– ability to restrict mail relaying to all but a positive match list of IP addresses, subnets and FQDNs
– ability to accept unresolvable domains
– ability to run multiple instances, each with independent configuration, with separate IPs, on same appliance
– ability to make mail routing delivery decisions based on recipient domain configurable by MX lookup, set IP address, FQDN with and without MX lookups
– ability to route all else via DNS lookup
– ability to include comments within the configuration
– ability to turn off ESMTP delivery agent to selected domains and act as simple SMTP delivery agent
– ability to send hourly reports
– log available in real time
– log containing at least these fields: sender, recipients, date/time, delay, size, messageId, TLS used flag, sending MTA, relay MTA, reject reason (if applicable)
– ability to analyze logs with RegEx
– ability to archive logs for up to three months
– ability to send test message through itself with customizable subject on periodic basis
– ability to report on queue contents by top sender/recipient/recipient domain
– ability to force delivery retry on selected domain
– ability to set greeting delay for selected IPs and subnets
– ability to run a browser from same IP as used by the MTA

Most, but not all, of these features are in configured in the .mc file. A few are actually reference to external programs I developed. A few rely on the Linux environment that sendmail runs under.

When you sit down and document it, there’s a lot going on in sendmail.

Admin Internet Mail Linux Perl

The IT Detective Agency: last letter of attachment name is missing!

Today we bring you an IT whodunit thriller. A user using Lotus Notes informs his local IT that a process that emails SQL reports to him and a few others has suddenly stopped working correctly. The reports either contain an HTML attachment where the attachment type has been chopped to “ht” instead of “htm,” or an MHTML attachment type which has also been chopped, down to “mh” instead of “mht.” They get emailed from the reporting server to a sendmail mail relay. Now the convenient ability to double-click on the attachment and launch it stopped working as a result of these chopped filenames. What’s going on? Fix it!

Let’s Reproduce the Problem
Fortunately this one was easier than most to reproduce. But first a digression. Let’s have some fun and challenge ourselves with it before we deep dive. What do you think the culprit is? What’s your hypothesis? Drawing on my many years of experience running enterprise-class sendmail servers, and never before having seen this problem despite the hundreds of millions of delivered emails, my best instincts told me to look elsewhere.

The origin server, let’s call it aspen, sends few messages, so I had the luxury to turn on tracing on my sendmail server with a filter limiting the traffic to its IP:

$ tcpdump -i eth0 -s 1540 -w /tmp/aspen.cap host aspen

Using wireshark to analyze asp.cap and following the tcp stream I see this:

Content-Type: multipart/mixed;
This is a multipart message in MIME format
Content-Type: text/plain;
Content-Transfer-Encoding: 7bit
SQLplus automated report
Content-Type: application/octet-stream;
		 name="tower status_2012_06_04--09.25.00.htm"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
		 filename="tower status_2012_06_04--09.25.00.htm
<html><head></head><body><h1>Content goes here...</h1></body>

Result of trace of original email as received by sendmail

But the source as viewed from within Lotus Notes is:

Content-Type: multipart/mixed;
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
SQLplus automated report
Content-Type: application/octet-stream;
		 name="tower status_2012_06_04--09.25.00.htm"
Content-Disposition: attachment;
Content-Transfer-Encoding: base64

Same email after being trasferred to Lotus Notes

I was in shock.

I fully expected the message source to go through unaltered all the way into Lotus Notes, but it didn’t. The trace taken before sendmail’s actions was not an exact match to the source of the message I received. So either sendmail or Lotus Notes (or both) were altering the source in significant ways.

At the same time, we got a big clue as to what is behind the missing letter in the file extension. To highlight it, compare this line from the trace:

filename=”tower status_2012_06_04–09.25.00.htm

to that same line as it appears in the Lotus Notes source:

filename=”tower status_2012_06_04–

So there is no final close quote (“) in the filename attribute as it comes from the aspen server! That can’t be good.

But it used to work. What do we make of that fact??

I had to dig farther. I was suddenly reminded of the final episode of House where it is apparent that the solving the puzzle of symptoms is the highest aspiration for Doctor House. Maybe I am similarly motivated? Because I was definitely willing to throw the full weight of my resources behind this mystery. At least for the half-day I had to spare on this.

First step was to reproduce the problem myself. For sending an email you would normally use sendmail or mailx or such, but I didn’t trust any of those programs – afraid they would mess with my headers in secret, undocumented ways.

So I wrote my own mail sending program using Perl/Expect. Now I’m not advocating this as a best practice. It’s just that for me, given my skillset and perceived difficulty in finding a proper program to do what I wanted (which I’m sure is out there), this was the path of least resistance, the best and most efficient use of my time. You see, I already had the core of the program written for another purpose, so I knew it wouldn’t be too difficult to finish for this purpose. And I admit I’m not the best at Expect and I’m not the best at Perl. I just know enough to get things done and pretty quickly at that.

OK. Enough apologies. Here’s that code:

# - 6/2012
# Send mail by explicit use of the protocol
$DEBUG = 1;
use Expect;
use Getopt::Std;
$recip = $opt_r;
$sender = $opt_s;
$hostname = $ENV{HOSTNAME};
print "hostname,mailhost,sender,recip: $hostname,$opt_m,$sender,$recip\n" if $DEBUG;
$telnet = "telnet";
@hosts = ($opt_m);
$logf = "/var/tmp/smtpresults.log";
$timeout = 15;
$data = qq(Subject: test of strange MIME error
X-myHeader: my-value
From: $sender
To: $recip
Subject: SQLplus Report - tower status
Date: Mon, 4 Jun 2012 9:25:10 --0400
Importance: Normal
X-Mailer: ATL CSmtp Class
X-MSMail-Priority: Normal
X-Priority: 3 (Normal)
MIME-Version: 1.0
Content-Type: multipart/mixed;
This is a multipart message in MIME format
Content-Type: text/plain;
Content-Transfer-Encoding: 7bit
SQLplus automated report
Content-Type: application/octet-stream;
        name="tower status_2012_06_04--09.25.00.htm"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
        filename="tower status_2012_06_04--09.25.00.htm
<html><head></head><body><h1>Content goes here...</h1></body>
sub myInit {
# This structure is ugly (p.148 in the book) but it's clear how to fill it
@steps = (
        { Expect => "220 ",
          Command => "helo $hostname"},
# Envelope sender
        { Expect => "250 ",
          Command => "mail from: $sender"},
# Envelope recipient
        { Expect => "250 ",
          Command => "rcpt to: $recip"},
# data command
        { Expect => "250 ",
          Command => "data"},
# start mail message
        { Expect => "354 Enter ",
          Command => $data},
# end session nicely
        { Expect => "250 Message accepted ",
          Command => "quit"},
}       # end sub myInit
# Main program
open(LOGF,">$logf") || die "Cannot open log file!!\n";
foreach $host (@hosts) {
# create an Expect object by spawning another process
sub login {
($host) = @_;
#@params = ($host," 25");
$init_command = "$telnet $host 25";
#$Expect::Debug = 3;
my $exp = Expect->spawn("$init_command")
         or die "Cannot spawn $command: $!\n";
# Now run all the other commands
foreach $step (@steps) {
  $expstr = %{$step}->{Expect};
  $cmd = %{$step}->{Command};
#  print "expstr,cmd: $expstr, $cmd\n";
# Logging
  $exp->log_stdout(0);  # disable stdout for each command
  @match_patterns = ($expstr);
  ($matched_pattern_position, $error, $successfully_matching_string, $before_match, $after_match) = $exp->expect($timeout,
  unless ($matched_pattern_position == 1) {
    $err = 1;
  #die "No match: error was: $error\n" unless $matched_pattern_position == 1;
  # We got our match. Proceed.
}       # end loop over all the steps
# hard close
}       # end sub login

Code for

Invoke it:

$ ./ -m sendmail_host -s -r

The nice thing with this program is that I can inject a message into sendmail, but also I can inject it directly into the Lotus Notes smtp gateway, bypassing sendmail, and thereby triangulate the problem. The sendmail and Lotus Notes servers have slightly different responses to the various protocol stages, hence I clipped the Expect strings down to the minimal common set of characters after some experimentation.

This program makes it easy to test several scenarios of interest. Leave the final quote and inject into either sendmail or Lotus Notes (LN). Tack on the final quote to see if that really fixes things. The results?

Missing final quote

with final quote added

inject to sendmail

ht” in final email to LN; extension chopped

htm” and all is good

inject to LN

htm in final email; but extension not chopped

htm” and all is good

I now had incontrovertible proof that sendmail, my sendmail was altering the original message. It is looking at the unbalanced quote mark situation and recovering as best as possible by replacing the terminating character “m” with the missing double quote “. I was beginning to suspect it. After that shock drained away, I tried to check the RFCs. I figured it must be some well-meaning attempt on its part to make things right. Well, the RFCs, 822 and 1806 are a little hard to read and apply to this situation.

Let’s be clear. There’s no question that the sender is wrong and ought to be closing out that quote. But I don’t think there’s some single, unambiguous statement from the RFCs that make that abundantly apparent. Nevertheless, of course that’s what I told them to do.

The other thing from reading the RFC is that the whole filename attribute looks optional. To satisfy my curiosity – and possibly provide more options for remediation to aspen – I sent a test where I entirely left out the offending filename=”tower… line. In that case the line above it should have its terminating semicolon shorn:

Content-Disposition: attachment

After all, there already is a name=”tower…” as a Content-type parameter, and the string following that was never in question: it has its terminating semicolon.

Yup, that worked just great too!

Then I thought of another approach. Shouldn’t the overriding definition of the what the filetype is be contained in the Content-type header? What if it were more correctly defined as

Content-type: text/html


Content-type appears in two places in this email. I changed them both for good measure, but left the unbalanced quotations problem. Nope. Lotus Notes did not know what to with the attachment it displays as tower status_2012_06_04– So we can’t recommend that course of action.

What Sendmail’s Point-of-View might be
Looking at the book, I see sendmail does care about MIME headers, in particular it cares about the Content-Disposition header. It feels that it is unreliable and hence merely advisory in nature. Also, some years ago there was a sendmail vulnerability wherein malformed multipart MIME messages could cause sendmail to crash (see So maybe sendmail is just a little sensitive to this situation and feels perfectly comfortable and justified in right-forming a malformed header. Just a guess on my part.

Case closed.

We battled a strange email attachment naming error which seemed to be an RFC violation of the MIME protocols. By carefully constructing a testing program we were easily able to reproduce the problem and isolate the fault and recommend corrective actions in the sending program. Now we have a convenient way to inject SMTP email whenever and wherever we want. We feel sendmail’s reputation remains unscathed, though its corrective actions could be characterized as overly solicitous.

Internet Mail

How to run sendmail in queue-only mode

I guess I’ve ragged on sendmail before. Incredibly powerful program. Finding out how to do that simple thing you want to do may not be so easy, even with the bible at your side. So to that end I’m making an effort to document those simple things which I’ve found I’ve struggled with.

The Details
Today I wanted to capture all email coming into my sendmail daemon. Well, actually it’s a little more complicated. I didn’t want to disturb production email, but I wanted to capture a spam sample. Today there was a hugely effective spam campaign purporting to be email from the Better Business Bureau (BBB). All the emails however actually came from various senders Postini put a filter in place but I knew more were getting through. But they weren’t coming to me. How to get capture them without disturbing users?

In this post I gave some obscure but useful tips for sendmail admins, including the ever-useful smarttable add-on. To reprise, smarttable allows you to make delivery decisions based on sender! That’s totally antithetical to your run-of-the-mill sendmail admin, but it’s really useful… Like now. So I quickly put up a sendmail instance, copying a working config I use in production. But I changed the listener to IP address (which I fortunately had already set up for some other reason I can no longer recall). That one’s pretty standard. That’s just:

DAEMON_OPTIONS(`Name=sm-cap, Addr=')dnl

Of course you want to create a new queue directory just for the captured emails. I created /mqueue/c0 and put in this line into my .mc file:

define(QUEUE_DIR, `/mqueue/c*')dnl

And here’s the main point, how to defer delivery of all emails. Sendmail actually distinguishes between defer and queueonly. I chose queueonly thusly:


If by chance you happen to misspell DELIVERY_MODE, like, let’s say, DELIERY_MODE, you don’t seem to get a whole lot of errors. Not that that would ever happen to us, mind you, I’m just saying. That’s why it’s good to also know about the command-line option. Keep reading for that.

It’s simple enough to test once you have it running (which I do with this line: sudo sendmail -bd -q -C/etc/mail/

> telnet 25
Connected to
Escape character is ‘^]’.
220 ESMTP server ready at Fri, 24 Feb 2012 15:16:40 -0500
helo localhost
250 Hello [], pleased to meet you
mail from:
250 2.1.0… Sender ok
rcpt to:
250 2.1.5… Recipient ok
354 Enter mail, end with “.” on a line by itself
subject: test of the capture-only sendmail instance

Just a test!
-Dr J

250 2.0.0 q1OKGet2008636 Message accepted for delivery
221 2.0.0 closing connection
Connection closed by foreign host.

Is the message there, queued up the way we’d like? You bet:

> ls -l /mqueue/c0

total 16
-rw------- 1 root root  19 2012-02-24 15:17 dfq1OKGet2008636
-rw------- 1 root root 542 2012-02-24 15:17 qfq1OKGet2008636

There also seems to be a second way to run sendmail in queue-only fashion. I got it to work from the command-line like this:

> sudo sendmail -odqueueonly -bd -C/etc/mail/

The book says this is deprecrated usage, however. But let’s see, that’s O’Reilly’s Sendmail 3rd edition, published in 2003, we’re in 2012, so, hmm, they still haven’t cut us off…

One last thing, that smarttable entry for my main sendmail daemon. I added the line: relay:[]

It can be useful to queue all incoming emails for various reasons. It’s a little hard to find out how to do this precisely. We found a way to do this without stopping/starting our main sendmail process. This post shows a couple ways to do it, and why you might need to.

May 2012 Update
Just wanted to mention about BBB email how I handle it now. They told me they maintain an accurate SPF record. Sure enough, they do. Now we only accept email when the SPF record is a match. But I don’t use sendmail for that, I use Postini’s (OK, Google’s, technically) mail hygiene service. Postini rocks!

My most recent post on how to tame the confounding sendmail log is here.