Categories
Admin Internet Mail

Obscure Tips for Sendmail Admins

Intro
Sendmail is an amazing program. The O’Reilly Sendmail book is its equal, coming in well over 1000 pages. I constantly marvel at how it was possible to pack so much knowledge into one book written by one person. Having run sendmail for over 10 years, I’ve built up a few inside tips that can be extremely hard to find out by yourself, even with the book’s help. I just learned one today, in fact, so I thought I’d put it plus some others in one place where their chances of being useful is slightly greater.

Tip 1: Multiple IPs in a Mailertable Entry, No MX Record Required
Today I learned that you can specify multiple domains in a mailertable entry even when you’re using IP addresses, as in this example:

drjohnstechblog.com       smtp:[50.17.188.196]:[10.10.10.11]

I tested it by putting 50.17.188.196 behind a firewall where it was unreachable. Sure enough, the smtp mail delivery agent of sendmail tried 10.10.10.11 next. You can continue to extend this with additional IPs

Why is this important? If someone has provided you a private IP to forward mail to, say because of a company-to-company VPN, you cannot rely on the usual DNS lookups to do the routing. And a big outfit may have two MTAs reachable in this way. Now you’ve got redundancy built-in to your delivery methods. Just as you have for organizations with multiple MX records. I paged through the book this morning and did not find it. Maybe it’s there. But it’s in an obscure spot if it is.

Tip 2: Error message Containing Punctuation
I also don’t think it’s obvious how to include multiple punctuation marks in a custom error message, even after reading the book. Here’s an example for your access table:

To:hotimail.com   ERROR:"550 You sent an email to hotimail.com.  You probably meant hotmail.com?"

So it’s the quotes that allow you to include the several punctuation marks. The 550 at the beginning will be seen as the error number.

Tip 3: Smarttable for Sender-Based Routing Decisions
Have you ever wanted to make routing decisions based on sender address rather than recipient address? Well, you can! The key is to use smarttable. In my MC file I have:

dnl Define an enhancement, smarttable, from Andrzej Filip
dnl now at http://jmaimon.com/sendmail/anfi.homeunix.net/sendmail/smarttab.html
FEATURE(`smarttable',`hash -o /etc/mail/smarttable')dnl

It’s sufficiently well documented at that page. You need his smarttable.m4 file. So this is not for beginners, but it’s not that hard, either. Although it looks like smarttable hasn’t been updated since 2002, I want to mention that it still works with the latest versions of sendmail. You can route based on the sender domain, or an individual sender address. I use it to send some messages to an encryption gateway. My smarttable entries tend to look like this:

[email protected]          relay:[192.168.12.34]

What’s First: Routing Based on Sender or Recipient??
What if your recipient’s domain is in the mailertable and your sender’s address is in the smarttable? What takes precedence in that case? The mailertable entry does. I do not know a way to change that. I actually did experience that conflict and found one way around it.

In my case I had some mailertable entries like this one:

drjohnstechblog.com            relay:drjohnstechtalk.com

with my smarttable entry as above. So I get into this conflict when [email protected] wants to send email to [email protected]. What I did is run a private BIND DNS server and remove the mailertable entry. My private DNS server is mostly a cache-only server with the usual Internet root servers. But since the public Internet value for the MX record for drjohnstechblog.com is not what I wanted for mail delivery purposes, I created a zone for drjohnstechblog.com on my private DNS server and created the MX record

drjohnstechblog.com            IN   MX   0 drjohnstechtalk.com.

thus overwriting the public MX value for drjohnstechblog.com. Then, of course, I have my server where I am running sendmail set to use my private DNS server as nameserver in /etc/resolv.conf, i.e.,

nameserver 127.0.0.1

since I ran my private DNS server on the same box. Without the mailertable entry sendmail uses DNS to determine how to deliver email unless of course the sender matches a smarttable entry! If my server relies on resolving other resource records within drjohnstechblog.com for other purposes then I have to redefine them, too.

This trick works for individual domains. What if you feel the need for an “everythnig else” entry in your mailertable, i.e.,

.                     relay:relayhost.drjohnstechtalk.com

Well, you’re stuck! I don’t have a solution for you. My DNS trick above could be extended to work for mail with some wildcard entries, but it will break so many other things that you don’t want to go there.

Tip 4: How to send the same email to two (or more) different servers
Someone claimed to need this unusual feature. See the discussion in the comment section about how I believe this is possible to do and an outline of how I would do it.
The blog posting I reference about running sendmail in queue-only mode is here.

Conclusion
Hopefully these sendmail tips will make your life as a sendmail admin toiling away in obscurity (not that I know anyone like that : ) ), just a little easier.

Resources
The sendmail book is the one by Bryan Costales. At Amazon: http://www.amazon.com/sendmail-4th-Bryan-Costales/dp/0596510292/ref=sr_1_1?s=books&ie=UTF8&qid=1316630255&sr=1-1.

My most recent post on how to tame the confounding sendmail log is here.

Using smarttable with a catch-all mailertable entry, plus virtusertable and more, is described in my latest sendmail post.

Categories
Admin DNS IT Operational Excellence

The IT Detective Agency: How We Neutralized Nasty DNS Clobbering Before it Could Bite Us

This gets a little involved. But if you’re the IT expert called on to fix something, you better be able to roll up your sleeves and figure it out!

In this article, I described how some, but not all ISPs change the results of DNS queries in violation of Internet standards.

A Proxy PAC for All
This work was done for an enterprise. They want everyone to use a proxy PAC file which whose location was to be (obfuscating the domain name just a little here) http://webproxy.intranet.drjohnstechtalk.com/proxy.pac. Centralized large enterprises like this sort of thing because the proxy settings are controlled in the one file, proxy.pac, by the central IT department.

So two IT guys try this PAC file setting on their work PC at their home networks. The guy with Comcast as his ISP reports that he can surf the Internet just fine at home. I, with Centurylink, am not so successful. It takes many minutes before an eventual timeout seems to occur and I cannot surf the Internet as long as I have that PAC file configured. But I can always uncheck it and life is good.

Now along comes a new requirement. This organization is going to roll out VPN without split tunneling, and the initial authentication to that VPN is a web page on the VPN switch. Now we have a real problem on our hands.

With my ISP, I can shut off the PAC file, get to the log-on page, establish VPN, but at that point if I wanted to get back out to the Internet (which is required for some job functions) I’d have to re-establish the PAC file setting. Furthermore it is desirable to lock down the proxy settings so that users can’t change them in any case. That makes it sound impossible for Centurylink customers, right?

Wrong. By the way the Comcast guy had this whole scenario working fine.

The Gory Details
This enterprise organization happened to have chosen legitimately owned but unused internal namespace for the PAC file location, analagous to my webproxy.intranet.drjohnstechtalk.com in my example. I reasoned as follows. Internet Explorer (“IE”) must quickly learn in the Comcast case that the domain name of the PAC file (webproxy.intranet.drjohnstechtalk.com) resolves with a NXDOMAIN and so it must fall back to making DIRECT connections to the Internet. For the unfortunate soul with CenturyLink (me), the domain name is clobbered! It does resolve, and to an active web site. That web site must produce a HTTP 404 not found. At least you’d think so. Today it seems to produce a simplified PAC file, which I am totally astonished by. And I wonder if this is more recent behaviour present in an attempt to ameliorate this situation. In any case, I reasoned that if they were clobbering a non-existent DNS record, we could actually define this domain name, but instead of going through the trouble of setting up a web server with the PAC file, just define the domain name as the loopback interface, 127.0.0.1. There’s no web server to connect to, so I hoped the browser would quickly detect this as a bad PAC URL, go on its way to make DIRECT connections to the VPN authentication web site, and then once VPN were established, use the PAC file again actively to permit the user to surf the Internet. And, furthermore, that this should work for both kinds of users: ones with DNS-clobbering ISPs and ones without.

That’s a lot of assumptions in the previous paragraph! But I built the case for it – it’s all based on reasonable extrapolation from observed behaviour. More testing needs to be done. What we have seen so far is that this DNS entry does no harm to the Comcast user. Direct Internet browsing works, VPN log-in works, Internet browsing post-login works. For the CenturyLink user the presence of this DNS entry permitted the browser of the work PC to surf the Internet very readily, which is already progress. VPN was not tested but I see no reason why it wouldn’t work.

More tests need to be done but it appears to be working out as per my educated guess.

April 2012 Update
Our fix seemed to collapse like a house of cards all-of-a-sudden many months later. Read how instead of panicking, we re-fixed it using our best understanding of the problems and mechanisms involved. The IT Detective Agency: Browsing Stopped Working on Internet-Connected Enterprise Laptops

Conclusion
We found a significant issue with DNS clobbering as practiced by some ISPs in an enterprise-class application: VPN. We found a work-around after taking an educated guess as to what would work – defining webproxy… to resolve to 127.0.01. We could have also changed the domain name of the PAC file – to one that wouldn’t be clobbered – but that was set by another group and so that option was not available to us. Also, we don’t yet know how extensive DNS clobbering is at other ISPs. Perhaps some clobber every domain name which returns a NXDOMAIN flag. That’s what Google’s DNS FAQ seems to imply at any rate. A more sensible approach may have been to migrate to use the auto-detect proxy settings, but that’s a big change for an enterprise and they weren’t ready to do that. A final concern is what if the PC is running a local web server because some application requires it?? That might affect our results.

Case: just about solved!

References
A related case of Verizon clobbering TCP reset packets is described here.

Categories
Admin Internet Mail IT Operational Excellence

The IT Detective Agency: The Case of Slow Sendmail Performance Finally Cracked

I’ve been running sendmail for years and years. It’s a very solid MTA, though perhaps not fashionable these days. At one point I even made the leap from running on Sun/solaris to SLES. I’ve always had a particular problem on a couple of these servers: they do not react gracefully to mail storms. An application running on another server sends out a daily mail blast to 2000 users, all at once. Hey I’m not running Gmail here, but normal volume is several messages per second nonetheless, and that is handled fairly well.

But this mail blast actually knocks the system offline for a few minutes. The load average rockets up to 160. It’s essentially a self-inflicted denial-of-service attack. In my gut I always felt the situation could be improved, but was too busy to look into it.

When it was time to buy a replacement server, I had to consider and justify what to get. A “screaming server” is a little hard for a hardware vendor to turn into an order! So where are the bottlenecks? I decided to capture output of uptime, which provides load averages, and iostat, an optional package which analyzes I/O usage, at five secon intervals throughout the day. Here’s the iostat job:

nohup iostat -t -c  -m -x 3 > /tmp/iostat &

and the uptime was a tiny script I called cpu-loop.sh:

#!/bin/sh
while /bin/true; do
sleep 5
date
uptime
done

called from the command line as:

nohup ~/cpu-loop.sh > /tmp/cpu &

Strange thing is that though load average shoots the roof, cpu usage isn’t all that high.

If I have this right, load average shows the number of processes scheduled by the scheduler. Sendmail forks a process for each incoming email, so the number of sendmail processes climbs dramatically during a mail storm.

The fundamental issue is are we thirsting for more CPU or more I/O? Then there are the peripheral concerns like speed of pci bus, size of level two cache and number of cpus. The standard profiling tools don’t quite give you enough information.

Here’s actual output of three consecutive iostat executions:

Time: 05:11:56 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.92    0.00    5.36   21.74    0.00   66.99

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    10.00    0.00    3.00     0.00     0.05    37.33     0.03    8.53   5.33   1.60
sdb               0.00   788.40    0.00  181.40     0.00     3.91    44.12     4.62   25.35   5.46  98.96
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.02    8.00   1.33   0.32
dm-3              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.01    5.67   2.33   0.56
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01   12.00   6.00   0.48
dm-5              0.00     0.00    0.00    7.60     0.00     0.03     8.00     0.08   10.32   1.05   0.80
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  975.00     0.00     3.81     8.00    20.93   21.39   1.01  98.96
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Time: 05:12:01 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.05    0.00    4.34   19.98    0.00   70.64

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    10.80    0.00    2.80     0.00     0.05    40.00     0.03   10.57   6.86   1.92
sdb               0.00   730.60    0.00  164.80     0.00     3.64    45.20     3.37   20.56   5.47  90.16
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.60     0.00     0.01     8.00     0.03   12.31   2.15   0.56
dm-3              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.02    6.33   3.33   0.80
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01    9.00   5.00   0.40
dm-5              0.00     0.00    0.00    7.60     0.00     0.03     8.00     0.10   13.37   1.16   0.88
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  899.60     0.00     3.51     8.00    16.18   18.03   1.00  90.24
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Time: 05:12:06 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.91    0.00    1.36   10.83    0.00   85.89

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     6.40    0.00    3.40     0.00     0.04    25.88     0.04   12.94   5.18   1.76
sdb               0.00   303.40    0.00   88.20     0.00     1.59    36.95     1.83   20.30   5.48  48.32
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.60     0.00     0.01     8.00     0.04   14.77   2.46   0.64
dm-3              0.00     0.00    0.00    0.60     0.00     0.00     8.00     0.00   12.00   5.33   0.32
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01   11.00   5.00   0.40
dm-5              0.00     0.00    0.00    5.80     0.00     0.02     8.00     0.08   12.97   1.66   0.96
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  393.00     0.00     1.54     8.00     6.46   16.03   1.23  48.32
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device sdb has reached crazy high utilization levels – 98% before dropping back down to 48%. An average queue size of 4.62 in the first run means a lot of queued up processes awaiting I/O. Write requests (merged) per second of 788 seems respectable. All this, while the CPU is 67% idle!

The conclusion: a solid state drive is in order. We are dying thirsting for I/O more than for CPU. But solid state drives cost money and have to be justified which takes time. Can we do something which proves it will bear out our hypothesis and really alleviate the problem? Yes! SSD is like accessing memory. So let’s build a virtual partition from our memory. tmpfs has made this sinfully easy:

mount -t tmpfs none /mqueue -o size=8192m

We set this to be sendmail’s queue directory. The sendmail mc command looks like this:

define(`QUEUE_DIR',`/mqueue/q*')dnl

which I need to further explain at some point.

Now it’s interesting that this tmpfs filesystem doesn’t even show up in iostat! I guess its usage all counts as cpu usage.

I now have to send my mail blast to the system with this tmpfs setup. I’m expecting to have essentially converted my lack of I/O into better usage of spare CPU, resulting in a higher-performance system.

The Results
The results are in and they are dramatic. Previous results using traditional 15K rotating drive:

- disk device became 98% busy
- cpu idle time only dropped as low as 69%
- load average peaked at 37
- SMTP port shut down for some minutes
- 2030 messages accepted in 187 seconds
- 11 messages/second

and now using tmpfs virtual filesystem:

- the load average rose to 3.1 - a much more tolerable result
- the cpu idle time dropped to 32% during the busiest time
- most imporantly, the server stayed open for business - the SMTP port did not shut down for the first time!!
- the 2000 messages were accepted in 34 seconds.  
- that's a record 59 messages/second!

Conclusion
Disk I/O was definitely the bottleneck for sendmail. tmpfs rocks! sendmail becomes five times faster using it, and is better behaved. The drawback of this filesystem type is that it is completely volatile and I stand to lose messages if the power ever goes out!

Case Closed!

Categories
Admin IT Operational Excellence Web Site Technologies

Virtual Server not Working in F5 BigIP

OK. This posting is only directly applicable to the small number of people who run BigIP load balancers. And of that set, only a certaIn subset will likely ever have this situation. Nevertheless, it’s useful to document it. There are lessons in it for the rest of us, it shows the creative problem-solving process used in IT, or rather the creative process that should be used.

So I had a virtual server associated with a certain pool and it was operating fine for years. Then something changes. We want to associate a new name with this virtual server, but test it first, all while keeping the old name working. Well, this is a secured site by which I mean it is running https rather than http. There’s nothing intrinsic in the web site itself that ties it to a particular name. If this were a run-of-the-mill non-secure site you would solve this problem with DNS. Set up an alias and you’re good to go. But secured sites are a wee bit trickier. They present a certificate after all. And the certificate has just one name, at least ours does. Guess I can address multi-name certificates known as Subject Alternative Name CERTs in a separate post. And that name is the original DNS name. What to do? Simple. As any BigIP admin would tell you you create a new virtual server and associate it with a new IP and a new SSL profile containing the new certificate you just bought but the old pool. In DNS assign this new IP to your new DNS name. That’s all pretty straightforward

Having done all that, I blithely tested with lynx (iI’s an old curses-based browser which runs on old Unix systems. The main point is to not test with a complex browser where like Internet Explorer where you are never 100% sure if the problem lies with the browser. If I had it, I would test with curl, but it’s not on that system.). And…it hangs.

Now I’ll admit a lot of stupid things I did (which is typical of any good debugging session of an IT professional – some self-created red herrings accompany any decent sleuthing) and I ratchet up the debugging a notch. Check the web server logs. I see no log of my lynx accesses. Dig a little deeper still. Fire up a trace. Here’s a little time-saver. BigIP does have a tcpdump program, but it is a little stunted. Typically you have multiple interfaces on a BigIP. In this case I felt it pertinent to know if packets were getting to the BigIP from lynx, and then again, if those packets were leaving the BigIP and going to the web server. So the tip is that whereas a “normal” tcpdump might allow you to use the switch -i any to listen on all interfaces, that doesn’t work on BigIP. Use -i 0.0 instead. And of course restrict it somehow so that your own shell session’s packets won’t be picked up by the trace, or else you could be in for a nasty surprise of exponentially increasing traffic (a devastating situation perhaps worthy of its own blog entry!). In this case I added an expression port 443. So I have:

tcpdump -i 0.0 port 443

And, somewhat to my surprise (You should always have a hypothesis, even if it’s just a gut feeling: will this little test work, or not. Why?) not only were packets going from lynx to BigIp and then again to the web server, I could even see returned packets back from the web server to BigIp to lynx. But it was not a lot of packets. A SYN, SYN-ACK and maybe a single data packet and that’s about it. It should have been more chatty.

The more tests you can think of, the better, especially ones that emphasize the marginal differences between the thing that works and the one that doesn’t. One test along those lines: take this same virtual server and associate it with a different pool. I did that, and that test worked!

Next, I tried to access the web server using curl on the BigIP itself. I could, but not at first. First I used the local web server URL http://web_server_ip:443/. It hung my curl command, just like using lynx on the other server. Hmm. I then looked on the web server again. I notice that it has a certificate installed. Ah. So it’s actually running https. So try curl from BigIP again, but this time with the -k switch (insecure, meaning don’t verify the certificate issuer) and a url beginning with https rather than http. Bingo. It comes back with the home page. Now we’re getting somewhere.

Finally I look more closely at the virtual server setup for the old name, the one that works. I see that the server profile is SSL. It basically means that the traffic is encrypted when it hits the BigIP, and the server CERT is associated with the external name. The BigIP decrypts the traffic, then re-encrypts it before sending it along to the web server. The CERT for the second leg is a self-signed CERT and is never seen by users.

I had forgotten to set up my new test virtual server with the server SSL profile, so the second leg of traffic was not being re-encyrpted by the BigIP, even though the web server was only willing to engage in SSL communication with the BigIP. Once I updated the server profile, it all worked fine! Of course after getting the expected results from lynx I went to my desktop browser, just like a regular user, and successfully tested it there as well. You want to make sure your final tests are a realistic approximation of what the user will be doing. If that’s not all possible under your own control, bring in a user for testing.

Liked this article? Here’s another of my IT operational excellence articles that has a somewhat wider applicability.