How to Stop Chinese Spam – for Mail Admins, w/ Sep 2012 update

(Updated 12/19/2011 with additional character set)
(updated 9/2012 with additional signature)
Intro
I have been a target for random Chinese language spam in my various email accounts, but the problem has really gotten worse in the past few months.

The thing about these messages is that at first Postini (a Google spam filtering service used mostly by businesses), wasn’t very good at catching them. Postini is about the best in the business, and they’re competently catching just about every other type of spam. But these Chinese character messages kept slipping through…

Their support tech gave me some advice which turned out to be incorrect, but led me in the right direction. Their tech told told me to create a content manager rule, but the actual rule he provided was only going to catch Russian and Ukranian spam!

This is the rule he provided:

Rule Name: Non_English_spam
"Match Any"
Header - matches regex

koi8-r|koi8-u|koi7|koi8
Disposition: delete (blackhole)
Set quarantine to Recipient

I had no idea what that was doing, so I looked up koi8-r, koi8, etc and found that it had to do with the Cyrillic alphabet. So I wondered if the Chinese language spams have something similar, but for Chinese. Indeed they do: gb2312. Looking at a few of my Chinese spams, almost all contain this string in the headers. It’s not always in the exact same place, but it’s there. To be concrete, here’s an example (some headers have been obfuscated to prevent the bad guys from trying to reverse engineer Postini’s scoring algorithms):

Received: from websmtp.sohu.com ([61.135.132.136]) by eu1sys200amx108.postini.com ([207.126.147.10]) with SMTP;
		 Sun, 28 Aug 2011 18:41:21 GMT
Received: from omlbw (unknown [110.53.27.141])
		 by websmtp.sohu.com (Postfix) with ESMTPA id 9B3C6720CEA;
		 Sun, 28 Aug 2011 23:55:04 +0800 (CST)
Message-ID: <20110828235546325581@sogou.com>
From: =?gb2312?B?y7O1wsf4xu/A1rbguabE3NfU0NCztdPQz965q8u+?= <66998448@sogou.com>
To: 
Subject: =?gb2312?B?d3Azz/ogytsg1vcgudwg1/Yg0KkgIMqyIMO0IA==?=
		 =?gb2312?B?uaQg1/cgssUgxNwgzOEgIMn9INK1ILyoIKO/LS0=?=
		 =?gb2312?B?qIk=?=
Date: Sun, 28 Aug 2011 23:55:37 +0800
MIME-Version: 1.0
X-mailer: Lzke 2
X-SOHU-Antispam-Bayes: 0
X-pstn-levels:     omitted
X-pstn-settings: omitted
X-pstn-addresses: from <66998448@sogou.com> [49/2] 

Content-Type: multipart/mixed;
		 boundary="----=_NextPart_000_015A_013AC9FA.1A2D5A60"

------=_NextPart_000_015A_013AC9FA.1A2D5A60
Content-Transfer-Encoding: base64
Content-Type: text/html;
		 charset="gb2312"

See it? charset=”gb2312″ appears in the content-type header and =?gb2312? appears in both the Subject and From fields.

That message looks like this as displayed in my mail client:

How do I know this is Chinese? I pasted the characters into translate.google.com and it auto-detected it. That’s a convenient tool!

How do I know it is spam? I am open-minded. Perhaps it is a legitimate business proposition that just happens to be written in Chinese? It does sort of read that way from the translation of any one such message. On the other side are some stronger pieces of evidence. The empty To: header is a strong hint, but some legitimate messages could contain that undesirable feature, so that is merely an indicator but not definitive. Most important is the fact that I get these messages, all showing similar patterns in appearance, and most telling always coming from a different sender tells me unambiguously that this is really, truly spam.

So the actual Postini Content Manager rule to capture Chinese spam is this:

Rule Name: Chinese_spam
"Match Any"
Header matches regex (charset="gb2312"|=\?GB2312\?)

Disposition: delete (blackhole)
Set quarantine to Recipient

Obviously this type of rule is a bit dangerous. What if you are expecting something written in Chinese? It will be subject to the same treatment as the spam. That is why the suggestion is to Set quarantine to recipient so that these messages could be delivered from the user quarantine.

And over the course of a couple months Postini has gotten much better about capturing this type of spam. That is the best thing – to let the experts handle it. They just needed to train their algorithms. I was quite concerned at first that this spam is so different from the usual, recognizable spam campaigns that they might have a hard time spotting it while simultaneously allowing the good Chinese email through. But they’re almost there…

12/19 UpdateThe filter described above has been working extremely well for me. Essentially perfectly, in fact, as I can see when I look in my quarantine. But not today. Today I got some suspected Chinese spam in and examing the headers showed something slightly different. The subject looks like this:

Subject: =?GBK?B?bnZ2dyAyMDExLjEyLTIwMTItMDEgvqsgxrcgzcYgz/ogIGZkZXI=?=

And the Mime header also had that string:

Content-Type: text/plain;
		 charset=GBK

Looking up GBK character set you’ll immediately see it is simplified Chinese, extended. So I think we better add that character set to our expression. It makes our content manager rule only a little more complicated. Now we would have:

Rule Name: Chinese_spam
"Match Any"
Header matches regex (charset="gb(k|2312)"|=\?GB(K|2312)\?)

Disposition: delete (blackhole)
Set quarantine to Recipient

For the complete prescription see the summary in the Conclusion.

If you happened upon this article and don’t have the Postini service is there any relevance? Yes, I think so. You should be able to filter on the message headers to look for the string =?gb2312? or =?gbk? in the beginning of the subject line. To speak about mailers with which I have some experience, in sendmail you could do this with a milter. In PureMessage it would be possible to concoct an appropriate rule as well.

9/2012 Update
My filter was working so well these past few months I essentially forgot about the problem, but the occasional Chinese spam slipped through. How? It used a different encoding. Here is an example subject line:

Subject: =?utf-8?B?6K+35p+l5pS277yB?=

This is displayed by my mail client as three Chinese characters followed by “!” They used a different encoding. This one drove me to do a little research. This is an Encoded-Word, according to Wikipedia’s excellent MIME writeup. The “?B?” in the front means base64 encoding. I had previously written a mimedecoder in perl, which I put to use:

> mimedecode 6K+35p+l5pS277yB

which produces:

???!

which is pretty much garbage. So I decided to analyze the output with unix utility od:

> mimedecode 6K+35p+l5pS277yB|od -x

which gives

0000000 e8af b7e6 9fa5 e694 b6ef bc81

Next, I needed a UTF-8 converter, which I found at this Swiss site.

I used it with input type hexadecimal.

The results reproduced exactly the Chinese characters my mail client displayed to me! It also gives a lot of other descriptions for these characters (such as Cangjie). The first few lines begin:

As character names:

U+8BF7 CJK UNIFIED IDEOGRAPH character (请)
U+67E5 CJK UNIFIED IDEOGRAPH character (查)
U+6536 CJK UNIFIED IDEOGRAPH character (收)
U+FF01 FULLWIDTH EXCLAMATION MARK character (!)

As raw characters:

请查收!

Well, that was an interesting exercise, but I’m not sure we’ve learned anything that can be put to use in a RegEx on the original expression. Unless there’s a way to uniquely identify Chinese characters by the beginning of the encoded-word sequence following the ?B?. I have my doubts, but since I don’t seem to get thee UTF-8 emails from other sources, and I have a sample size of about five emails that fooled the other filter to work with, I have developed a content filter which would capture all of them!

Check for a header containing the RegEx:

=\?utf-8\?B\?[56]

More specifically sometimes the utf-8 string is used in the From header, sometimes it is in the subject. Most of my samples would have been caught by the simpler RegEx =\?utf-8\?B\?5, and I mention that in case you want to be more specific, but there was one recent one that had a “6″ instead of a “5.”

For the record here’s that mimedecode “program”

#!/usr/bin/perl
# base64 MIME decoding
# example:
# mimedecode Nz84QGxhdGU=
# => 7?8@late
use MIME::Base64;
 
foreach (@ARGV) {
#      $encoded = encode_base64($_);
      $decoded = decode_base64($_);
#print "enc,dec: $encoded, $decoded\n";
        print $decoded;
}

There’s probably a built-in linux utility which does the same thing, I just don’t know what that is.

Conclusion
Your users needn’t suffer from Chinese Spam. The vast majority are characterized by, um, Chinese characters, of course, whose presence is almost always indicated by the string gb2312 in the message headers. You can take advantage of that fact and build an appropriate rule for Postini or your mailer. But beware of throwing out the baby with the bathwater! In other words, make sure you only subject your users to this rule unless you either have a good quarantine, or they are sure they should never receive this type of email.

There are some spam types which evade the gb2312 rule mentioned above, however. And this part is not as well tested, frankly. The exceptions, which are still a minority of my Chinese spam, are characterized by a subject line or sender that contains =?utf-8?B?5… or =?utf-8?B?6… (see summary below). My honest expectation is that a rule this broad and coarse will also catch a few other languages (Portuguese?, Urdu?, etc.) so be careful! If you are expecting to get non-english email more testing is in order before implementing the utf-8 filter. But it will certainly help to eliminate even more Chinese spam.

4/2013 update
Summary
My filter has worked very well for me and has withstood the test of time. I catch at least a dozen Chinese spams each day. None get through. I realize reading the above write-up is confusing because I’ve mixed my love of telling a good IT mystery with my desire to convey useful information. So, to summarize, the rule I have been using these last months is:

Match Any:

Header matches RegEx:
(charset=”gb(k|2312)”|=\?GB(K|2312)\?)

Header matches RegEx:
=\?utf-8\?B\?[56]

This entry was posted in Internet Mail, IT Operational Excellence and tagged , , , , , . Bookmark the permalink.

11 Responses to How to Stop Chinese Spam – for Mail Admins, w/ Sep 2012 update

  1. Justin says:

    Excellent advice got at 2nd position in Google search for my query “filter chinese language email”.

    I have several websites and am subjected to this Chinese email flood terrorism day in and out. Armed with your insight, I could block everything that smelt Chinese.

    “throwing out the baby with the bathwater” – I don’t think so! None of my websites is supposed to receive email in Chinese – if some genuine Chinese user wants to deal with me, they should learn MY language (English), and write in that!

    Note for cpanel users:

    If you use cpanel, you must have already figured out how to do this. If you could not you shouldn’t be using cpanel ;)

    Anyway, here’s what you need to do.
    Go to “Email Filtering”.
    Create a New Filter.

    Filter Name: chinese_spam (or whatever you like)

    Rules (in the 3 boxes):
    Any header / matches regex / gb2312
    or
    Any header / matches regex / gbk

    Actions:
    Discard Message (or whatever you deem fit).

  2. JO says:

    I dont have an open mind. I want to delete every email I get that has chinese characters in it. Tired of getting dozens of spam emails from them every single day. They waste my time.

  3. Phil says:

    do I need the brackets in Postini? My regex statement just looks like this:

    big5|euc-cn|euc-tw|iso-2022-cn|gb2312

    Does this look correct?

    • clockworkdiamond says:

      I’m working on a similar Chinese spam issue. Did your filter work as you had it when you posted it?

  4. tony says:

    This is not working for my test email, I have tried all possible solution from the blog and from the comments with OR condition but my test email still when through, any idea?
    My filters:
    Any header
    Matches Regex
    (charset=”gb(k|2312)”|=\?GB(K|2312)\?)

    OR

    Any header
    Matches Regex
    =\?utf-8\?B\?[56]

    OR

    Any header
    Matches Regex
    gb2312

    OR

    Any header
    Matches Regex
    gbk

    OR

    Any header
    Matches Regex
    big5|euc-cn|euc-tw|iso-2022-cn|gb2312

    • Doron Offir says:

      This regEx seems to cover both, worked for me.
      ((charset=”gb(k|2312)”|=\?GB(K|2312)\?)|(=\?utf-8\?B\?[56]))

  5. Steve Rawlinson says:

    This is great until the chinese spammer uses quoted printable instead of base 64. The start of the subject line looks like this: =?utf-8?Q? and the hex values are written with a preceding equals sign, eg =E4=B8=8A.

    It turns out that the unicode for chinese requires three octets and that means the first octet must start ’1110′ in binary. The minimum value of the first octet is therefore E0 in hex. One-byte encoding start with 0 (max value 7F) and two-byte encodings must start ’110′ (max value DF) so if you see the string ‘=E’ you know you’re dealing with three-byte encoding which is either Chinese or Hangul (whatever that is).

    So this regex should work: =\?utf-8\?Q\?.*=E

  6. Richard says:

    Most helpful blog. Thanks very much to everyone that offered solutions. A real time saver!

  7. MK says:

    Perfect article. Thank u very much for that one

  8. Andy says:

    A little late to the party, but should anyone arrive here looking for a mail client solution, e.g. Mac Mail, I have been using a naive (but works well) approach – just copy and paste a common Chinese character into the message rules for the content of mail and delete/junk/move it :)

    If you’re looking for some candidate characters, try a few of these:
    http://www.zein.se/patrick/3000char.html

Leave a Reply

Your email address will not be published. Required fields are marked *


9 × three =

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">