Postini – Dr John's Tech Talk

Intro
I guess I’ve ragged on sendmail before. Incredibly powerful program. Finding out how to do that simple thing you want to do may not be so easy, even with the bible at your side. So to that end I’m making an effort to document those simple things which I’ve found I’ve struggled with.

The Details
Today I wanted to capture all email coming into my sendmail daemon. Well, actually it’s a little more complicated. I didn’t want to disturb production email, but I wanted to capture a spam sample. Today there was a hugely effective spam campaign purporting to be email from the Better Business Bureau (BBB). All the emails however actually came from various senders @aicpa.org. Postini put a filter in place but I knew more were getting through. But they weren’t coming to me. How to get capture them without disturbing users?

In this post I gave some obscure but useful tips for sendmail admins, including the ever-useful smarttable add-on. To reprise, smarttable allows you to make delivery decisions based on sender! That’s totally antithetical to your run-of-the-mill sendmail admin, but it’s really useful… Like now. So I quickly put up a sendmail instance, copying a working config I use in production. But I changed the listener to IP address 127.0.0.2 (which I fortunately had already set up for some other reason I can no longer recall). That one’s pretty standard. That’s just:

DAEMON_OPTIONS(`Name=sm-cap, Addr=127.0.0.2')dnl

Of course you want to create a new queue directory just for the captured emails. I created /mqueue/c0 and put in this line into my .mc file:

define(QUEUE_DIR, `/mqueue/c*')dnl

And here’s the main point, how to defer delivery of all emails. Sendmail actually distinguishes between defer and queueonly. I chose queueonly thusly:

define(`confDELIVERY_MODE',`queueonly')dnl

If by chance you happen to misspell DELIVERY_MODE, like, let’s say, DELIERY_MODE, you don’t seem to get a whole lot of errors. Not that that would ever happen to us, mind you, I’m just saying. That’s why it’s good to also know about the command-line option. Keep reading for that.

It’s simple enough to test once you have it running (which I do with this line: sudo sendmail -bd -q -C/etc/mail/capture.cf).

> telnet 127.0.0.2 25
Trying 127.0.0.2…
Connected to 127.0.0.2.
Escape character is ‘^]’.
220 drj.com ESMTP server ready at Fri, 24 Feb 2012 15:16:40 -0500
helo localhost
250 drjemgw2.drj.com Hello [127.0.0.2], pleased to meet you
mail from: asd@gmail.com
250 2.1.0 asd@gmail.com… Sender ok
rcpt to: drj@drj.com
250 2.1.5 drj@drj.com… Recipient ok
data
354 Enter mail, end with “.” on a line by itself
subject: test of the capture-only sendmail instance

Just a test!
-Dr J
.
250 2.0.0 q1OKGet2008636 Message accepted for delivery
quit
221 2.0.0 drj.com closing connection
Connection closed by foreign host.

Is the message there, queued up the way we’d like? You bet:

> ls -l /mqueue/c0

total 16
-rw------- 1 root root  19 2012-02-24 15:17 dfq1OKGet2008636
-rw------- 1 root root 542 2012-02-24 15:17 qfq1OKGet2008636

There also seems to be a second way to run sendmail in queue-only fashion. I got it to work from the command-line like this:

> sudo sendmail -odqueueonly -bd -C/etc/mail/capture.cf

The book says this is deprecrated usage, however. But let’s see, that’s O’Reilly’s Sendmail 3rd edition, published in 2003, we’re in 2012, so, hmm, they still haven’t cut us off…

One last thing, that smarttable entry for my main sendmail daemon. I added the line:

@aicpa.org relay:[127.0.0.2]

Conclusion
It can be useful to queue all incoming emails for various reasons. It’s a little hard to find out how to do this precisely. We found a way to do this without stopping/starting our main sendmail process. This post shows a couple ways to do it, and why you might need to.

May 2012 Update
Just wanted to mention about BBB email how I handle it now. They told me they maintain an accurate SPF record. Sure enough, they do. Now we only accept bbb.org email when the SPF record is a match. But I don’t use sendmail for that, I use Postini’s (OK, Google’s, technically) mail hygiene service. Postini rocks!

My most recent post on how to tame the confounding sendmail log is here.

(Updated 12/19/2011 and 6/2014 with additional character sets)
(updated 9/2012 with additional signature)
Intro
I have been a target for random Chinese language spam in my various email accounts, but the problem has really gotten worse in the past few months.

The thing about these messages is that at first Postini (a Google spam filtering service used mostly by businesses), wasn’t very good at catching them. Postini is about the best in the business, and they’re competently catching just about every other type of spam. But these Chinese character messages kept slipping through…

Their support tech gave me some advice which turned out to be incorrect, but led me in the right direction. Their tech told told me to create a content manager rule, but the actual rule he provided was only going to catch Russian and Ukranian spam!

This is the rule he provided:

Rule Name: Non_English_spam
"Match Any"
Header - matches regex

koi8-r|koi8-u|koi7|koi8
Disposition: delete (blackhole)
Set quarantine to Recipient

I had no idea what that was doing, so I looked up koi8-r, koi8, etc and found that it had to do with the Cyrillic alphabet. So I wondered if the Chinese language spams have something similar, but for Chinese. Indeed they do: gb2312. Looking at a few of my Chinese spams, almost all contain this string in the headers. It’s not always in the exact same place, but it’s there. To be concrete, here’s an example (some headers have been obfuscated to prevent the bad guys from trying to reverse engineer Postini’s scoring algorithms):

Received: from websmtp.sohu.com ([61.135.132.136]) by eu1sys200amx108.postini.com ([207.126.147.10]) with SMTP;
		 Sun, 28 Aug 2011 18:41:21 GMT
Received: from omlbw (unknown [110.53.27.141])
		 by websmtp.sohu.com (Postfix) with ESMTPA id 9B3C6720CEA;
		 Sun, 28 Aug 2011 23:55:04 +0800 (CST)
Message-ID: <20110828235546325581@sogou.com>
From: =?gb2312?B?y7O1wsf4xu/A1rbguabE3NfU0NCztdPQz965q8u+?= <66998448@sogou.com>
To: 
Subject: =?gb2312?B?d3Azz/ogytsg1vcgudwg1/Yg0KkgIMqyIMO0IA==?=
		 =?gb2312?B?uaQg1/cgssUgxNwgzOEgIMn9INK1ILyoIKO/LS0=?=
		 =?gb2312?B?qIk=?=
Date: Sun, 28 Aug 2011 23:55:37 +0800
MIME-Version: 1.0
X-mailer: Lzke 2
X-SOHU-Antispam-Bayes: 0
X-pstn-levels:     omitted
X-pstn-settings: omitted
X-pstn-addresses: from <66998448@sogou.com> [49/2] 

Content-Type: multipart/mixed;
		 boundary="----=_NextPart_000_015A_013AC9FA.1A2D5A60"

------=_NextPart_000_015A_013AC9FA.1A2D5A60
Content-Transfer-Encoding: base64
Content-Type: text/html;
		 charset="gb2312"

See it? charset=”gb2312″ appears in the content-type header and =?gb2312? appears in both the Subject and From fields.

That message looks like this as displayed in my mail client:

How do I know this is Chinese? I pasted the characters into translate.google.com and it auto-detected it. That’s a convenient tool!

How do I know it is spam? I am open-minded. Perhaps it is a legitimate business proposition that just happens to be written in Chinese? It does sort of read that way from the translation of any one such message. On the other side are some stronger pieces of evidence. The empty To: header is a strong hint, but some legitimate messages could contain that undesirable feature, so that is merely an indicator but not definitive. Most important is the fact that I get these messages, all showing similar patterns in appearance, and most telling always coming from a different sender tells me unambiguously that this is really, truly spam.

So the actual Postini Content Manager rule to capture Chinese spam is this:

Rule Name: Chinese_spam
"Match Any"
Header matches regex (charset="gb2312"|=\?GB2312\?)

Disposition: delete (blackhole)
Set quarantine to Recipient

Obviously this type of rule is a bit dangerous. What if you are expecting something written in Chinese? It will be subject to the same treatment as the spam. That is why the suggestion is to Set quarantine to recipient so that these messages could be delivered from the user quarantine.

And over the course of a couple months Postini has gotten much better about capturing this type of spam. That is the best thing – to let the experts handle it. They just needed to train their algorithms. I was quite concerned at first that this spam is so different from the usual, recognizable spam campaigns that they might have a hard time spotting it while simultaneously allowing the good Chinese email through. But they’re almost there…

12/19 UpdateThe filter described above has been working extremely well for me. Essentially perfectly, in fact, as I can see when I look in my quarantine. But not today. Today I got some suspected Chinese spam in and examing the headers showed something slightly different. The subject looks like this:

Subject: =?GBK?B?bnZ2dyAyMDExLjEyLTIwMTItMDEgvqsgxrcgzcYgz/ogIGZkZXI=?=

And the Mime header also had that string:

Content-Type: text/plain;
		 charset=GBK

Looking up GBK character set you’ll immediately see it is simplified Chinese, extended. So I think we better add that character set to our expression. It makes our content manager rule only a little more complicated. Now we would have:

Rule Name: Chinese_spam
"Match Any"
Header matches regex (charset="gb(k|2312)"|=\?GB(K|2312)\?)

Disposition: delete (blackhole)
Set quarantine to Recipient

For the complete prescription see the summary in the Conclusion.

If you happened upon this article and don’t have the Postini service is there any relevance? Yes, I think so. You should be able to filter on the message headers to look for the string =?gb2312? or =?gbk? in the beginning of the subject line. To speak about mailers with which I have some experience, in sendmail you could do this with a milter. In PureMessage it would be possible to concoct an appropriate rule as well.

9/2012 Update
My filter was working so well these past few months I essentially forgot about the problem, but the occasional Chinese spam slipped through. How? It used a different encoding. Here is an example subject line:

Subject: =?utf-8?B?6K+35p+l5pS277yB?=

This is displayed by my mail client as three Chinese characters followed by “!” They used a different encoding. This one drove me to do a little research. This is an Encoded-Word, according to Wikipedia’s excellent MIME writeup. The “?B?” in the front means base64 encoding. I had previously written a mimedecoder in perl, which I put to use:

> mimedecode 6K+35p+l5pS277yB

which produces:

???!

which is pretty much garbage. So I decided to analyze the output with unix utility od:

> mimedecode 6K+35p+l5pS277yB|od -x

which gives

0000000 e8af b7e6 9fa5 e694 b6ef bc81

Next, I needed a UTF-8 converter, which I found at this Swiss site.

I used it with input type hexadecimal.

The results reproduced exactly the Chinese characters my mail client displayed to me! It also gives a lot of other descriptions for these characters (such as Cangjie). The first few lines begin:

As character names:

U+8BF7 CJK UNIFIED IDEOGRAPH character (请)
U+67E5 CJK UNIFIED IDEOGRAPH character (查)
U+6536 CJK UNIFIED IDEOGRAPH character (收)
U+FF01 FULLWIDTH EXCLAMATION MARK character (！)

As raw characters:

请查收！

…

Well, that was an interesting exercise, but I’m not sure we’ve learned anything that can be put to use in a RegEx on the original expression. Unless there’s a way to uniquely identify Chinese characters by the beginning of the encoded-word sequence following the ?B?. I have my doubts, but since I don’t seem to get thee UTF-8 emails from other sources, and I have a sample size of about five emails that fooled the other filter to work with, I have developed a content filter which would capture all of them!

Check for a header containing the RegEx:

=\?utf-8\?B\?[56]

More specifically sometimes the utf-8 string is used in the From header, sometimes it is in the subject. Most of my samples would have been caught by the simpler RegEx =\?utf-8\?B\?5, and I mention that in case you want to be more specific, but there was one recent one that had a “6” instead of a “5.”

For the record here’s that mimedecode “program”

#!/usr/bin/perl
# base64 MIME decoding
# example:
# mimedecode Nz84QGxhdGU=
# =&gt; 7?8@late
use MIME::Base64;
 
foreach (@ARGV) {
#      $encoded = encode_base64($_);
      $decoded = decode_base64($_);
#print "enc,dec: $encoded, $decoded\n";
        print $decoded;
}

And its sister program, which I call mimeencode:

#!/usr/bin/perl
# base64 MIME decoding
# DrJ, 6/2004
# example:
# mimedecode Nz84QGxhdGU=
# =&gt; 7?8@late
use MIME::Base64;
 
foreach (@ARGV) {
      $encoded = encode_base64($_);
#      $decoded = decode_base64($_);
#print "enc,dec: $encoded, $decoded\n";
        print $encoded;
}

There’s probably a built-in linux utility which does the same thing, I just don’t know what that is.

2022(!) update

Well, I finally ran across it. The built-in program to do mimeencode/mimedecode is base64. Oh well, better late than never…

Conclusion
Your users needn’t suffer from Chinese Spam. The vast majority are characterized by, um, Chinese characters, of course, whose presence is almost always indicated by the string gb2312 in the message headers. You can take advantage of that fact and build an appropriate rule for Postini or your mailer. But beware of throwing out the baby with the bathwater! In other words, make sure you only subject your users to this rule unless you either have a good quarantine, or they are sure they should never receive this type of email.

There are some spam types which evade the gb2312 rule mentioned above, however. And this part is not as well tested, frankly. The exceptions, which are still a minority of my Chinese spam, are characterized by a subject line or sender that contains =?utf-8?B?5… or =?utf-8?B?6… (see summary below). My honest expectation is that a rule this broad and coarse will also catch a few other languages (Portuguese?, Urdu?, etc.) so be careful! If you are expecting to get non-english email more testing is in order before implementing the utf-8 filter. But it will certainly help to eliminate even more Chinese spam.

4/2013 update
Summary, including 6/2014 update
My filter has worked very well for me and has withstood the test of time. I catch at least a dozen Chinese spams each day. One got through in 6/2014 however, with character set gb18030. I realize reading the above write-up is confusing because I’ve mixed my love of telling a good IT mystery with my desire to convey useful information. So, to summarize, the new combined rule is:

Match Any:

Header matches RegEx:
(charset=”gb(k|2312|18030)”|=\?GB(K|2312|18030)\?)

Header matches RegEx:
=\?utf-8\?B\?[56]

References
A spate of spam from enom-registered domains is described here.
A disappointing case where Google is not operating their Gmail service as a white-glove service is described here.