(Updated 12/19/2011 and 6/2014 with additional character sets)
(updated 9/2012 with additional signature)
Intro
I have been a target for random Chinese language spam in my various email accounts, but the problem has really gotten worse in the past few months.
The thing about these messages is that at first Postini (a Google spam filtering service used mostly by businesses), wasn’t very good at catching them. Postini is about the best in the business, and they’re competently catching just about every other type of spam. But these Chinese character messages kept slipping through…
Their support tech gave me some advice which turned out to be incorrect, but led me in the right direction. Their tech told told me to create a content manager rule, but the actual rule he provided was only going to catch Russian and Ukranian spam!
This is the rule he provided:
Rule Name: Non_English_spam "Match Any" Header - matches regex koi8-r|koi8-u|koi7|koi8 Disposition: delete (blackhole) Set quarantine to Recipient
I had no idea what that was doing, so I looked up koi8-r, koi8, etc and found that it had to do with the Cyrillic alphabet. So I wondered if the Chinese language spams have something similar, but for Chinese. Indeed they do: gb2312. Looking at a few of my Chinese spams, almost all contain this string in the headers. It’s not always in the exact same place, but it’s there. To be concrete, here’s an example (some headers have been obfuscated to prevent the bad guys from trying to reverse engineer Postini’s scoring algorithms):
Received: from websmtp.sohu.com ([61.135.132.136]) by eu1sys200amx108.postini.com ([207.126.147.10]) with SMTP; Sun, 28 Aug 2011 18:41:21 GMT Received: from omlbw (unknown [110.53.27.141]) by websmtp.sohu.com (Postfix) with ESMTPA id 9B3C6720CEA; Sun, 28 Aug 2011 23:55:04 +0800 (CST) Message-ID: <[email protected]> From: =?gb2312?B?y7O1wsf4xu/A1rbguabE3NfU0NCztdPQz965q8u+?= <[email protected]> To: Subject: =?gb2312?B?d3Azz/ogytsg1vcgudwg1/Yg0KkgIMqyIMO0IA==?= =?gb2312?B?uaQg1/cgssUgxNwgzOEgIMn9INK1ILyoIKO/LS0=?= =?gb2312?B?qIk=?= Date: Sun, 28 Aug 2011 23:55:37 +0800 MIME-Version: 1.0 X-mailer: Lzke 2 X-SOHU-Antispam-Bayes: 0 X-pstn-levels: omitted X-pstn-settings: omitted X-pstn-addresses: from <[email protected]> [49/2] Content-Type: multipart/mixed; boundary="----=_NextPart_000_015A_013AC9FA.1A2D5A60" ------=_NextPart_000_015A_013AC9FA.1A2D5A60 Content-Transfer-Encoding: base64 Content-Type: text/html; charset="gb2312"
See it? charset=”gb2312″ appears in the content-type header and =?gb2312? appears in both the Subject and From fields.
That message looks like this as displayed in my mail client:
How do I know this is Chinese? I pasted the characters into translate.google.com and it auto-detected it. That’s a convenient tool!
How do I know it is spam? I am open-minded. Perhaps it is a legitimate business proposition that just happens to be written in Chinese? It does sort of read that way from the translation of any one such message. On the other side are some stronger pieces of evidence. The empty To: header is a strong hint, but some legitimate messages could contain that undesirable feature, so that is merely an indicator but not definitive. Most important is the fact that I get these messages, all showing similar patterns in appearance, and most telling always coming from a different sender tells me unambiguously that this is really, truly spam.
So the actual Postini Content Manager rule to capture Chinese spam is this:
Rule Name: Chinese_spam "Match Any" Header matches regex (charset="gb2312"|=\?GB2312\?) Disposition: delete (blackhole) Set quarantine to Recipient
Obviously this type of rule is a bit dangerous. What if you are expecting something written in Chinese? It will be subject to the same treatment as the spam. That is why the suggestion is to Set quarantine to recipient so that these messages could be delivered from the user quarantine.
And over the course of a couple months Postini has gotten much better about capturing this type of spam. That is the best thing – to let the experts handle it. They just needed to train their algorithms. I was quite concerned at first that this spam is so different from the usual, recognizable spam campaigns that they might have a hard time spotting it while simultaneously allowing the good Chinese email through. But they’re almost there…
12/19 UpdateThe filter described above has been working extremely well for me. Essentially perfectly, in fact, as I can see when I look in my quarantine. But not today. Today I got some suspected Chinese spam in and examing the headers showed something slightly different. The subject looks like this:
Subject: =?GBK?B?bnZ2dyAyMDExLjEyLTIwMTItMDEgvqsgxrcgzcYgz/ogIGZkZXI=?= |
And the Mime header also had that string:
Content-Type: text/plain; charset=GBK |
Looking up GBK character set you’ll immediately see it is simplified Chinese, extended. So I think we better add that character set to our expression. It makes our content manager rule only a little more complicated. Now we would have:
Rule Name: Chinese_spam "Match Any" Header matches regex (charset="gb(k|2312)"|=\?GB(K|2312)\?) Disposition: delete (blackhole) Set quarantine to Recipient
For the complete prescription see the summary in the Conclusion.
If you happened upon this article and don’t have the Postini service is there any relevance? Yes, I think so. You should be able to filter on the message headers to look for the string =?gb2312? or =?gbk? in the beginning of the subject line. To speak about mailers with which I have some experience, in sendmail you could do this with a milter. In PureMessage it would be possible to concoct an appropriate rule as well.
9/2012 Update
My filter was working so well these past few months I essentially forgot about the problem, but the occasional Chinese spam slipped through. How? It used a different encoding. Here is an example subject line:
Subject: =?utf-8?B?6K+35p+l5pS277yB?= |
This is displayed by my mail client as three Chinese characters followed by “!” They used a different encoding. This one drove me to do a little research. This is an Encoded-Word, according to Wikipedia’s excellent MIME writeup. The “?B?” in the front means base64 encoding. I had previously written a mimedecoder in perl, which I put to use:
> mimedecode 6K+35p+l5pS277yB
which produces:
???! |
which is pretty much garbage. So I decided to analyze the output with unix utility od:
> mimedecode 6K+35p+l5pS277yB|od -x
which gives
0000000 e8af b7e6 9fa5 e694 b6ef bc81 |
Next, I needed a UTF-8 converter, which I found at this Swiss site.
I used it with input type hexadecimal.
The results reproduced exactly the Chinese characters my mail client displayed to me! It also gives a lot of other descriptions for these characters (such as Cangjie). The first few lines begin:
As character names:
U+8BF7 CJK UNIFIED IDEOGRAPH character (请)
U+67E5 CJK UNIFIED IDEOGRAPH character (查)
U+6536 CJK UNIFIED IDEOGRAPH character (收)
U+FF01 FULLWIDTH EXCLAMATION MARK character (!)
As raw characters:
请查收!
…
Well, that was an interesting exercise, but I’m not sure we’ve learned anything that can be put to use in a RegEx on the original expression. Unless there’s a way to uniquely identify Chinese characters by the beginning of the encoded-word sequence following the ?B?. I have my doubts, but since I don’t seem to get thee UTF-8 emails from other sources, and I have a sample size of about five emails that fooled the other filter to work with, I have developed a content filter which would capture all of them!
Check for a header containing the RegEx:
=\?utf-8\?B\?[56] |
More specifically sometimes the utf-8 string is used in the From header, sometimes it is in the subject. Most of my samples would have been caught by the simpler RegEx =\?utf-8\?B\?5, and I mention that in case you want to be more specific, but there was one recent one that had a “6” instead of a “5.”
For the record here’s that mimedecode “program”
#!/usr/bin/perl # base64 MIME decoding # example: # mimedecode Nz84QGxhdGU= # => 7?8@late use MIME::Base64; foreach (@ARGV) { # $encoded = encode_base64($_); $decoded = decode_base64($_); #print "enc,dec: $encoded, $decoded\n"; print $decoded; } |
And its sister program, which I call mimeencode:
#!/usr/bin/perl # base64 MIME decoding # DrJ, 6/2004 # example: # mimedecode Nz84QGxhdGU= # => 7?8@late use MIME::Base64; foreach (@ARGV) { $encoded = encode_base64($_); # $decoded = decode_base64($_); #print "enc,dec: $encoded, $decoded\n"; print $encoded; } |
There’s probably a built-in linux utility which does the same thing, I just don’t know what that is.
2022(!) update
Well, I finally ran across it. The built-in program to do mimeencode/mimedecode is base64. Oh well, better late than never…
Conclusion
Your users needn’t suffer from Chinese Spam. The vast majority are characterized by, um, Chinese characters, of course, whose presence is almost always indicated by the string gb2312 in the message headers. You can take advantage of that fact and build an appropriate rule for Postini or your mailer. But beware of throwing out the baby with the bathwater! In other words, make sure you only subject your users to this rule unless you either have a good quarantine, or they are sure they should never receive this type of email.
There are some spam types which evade the gb2312 rule mentioned above, however. And this part is not as well tested, frankly. The exceptions, which are still a minority of my Chinese spam, are characterized by a subject line or sender that contains =?utf-8?B?5… or =?utf-8?B?6… (see summary below). My honest expectation is that a rule this broad and coarse will also catch a few other languages (Portuguese?, Urdu?, etc.) so be careful! If you are expecting to get non-english email more testing is in order before implementing the utf-8 filter. But it will certainly help to eliminate even more Chinese spam.
4/2013 update
Summary, including 6/2014 update
My filter has worked very well for me and has withstood the test of time. I catch at least a dozen Chinese spams each day. One got through in 6/2014 however, with character set gb18030. I realize reading the above write-up is confusing because I’ve mixed my love of telling a good IT mystery with my desire to convey useful information. So, to summarize, the new combined rule is:
Match Any:
Header matches RegEx:
(charset=”gb(k|2312|18030)”|=\?GB(K|2312|18030)\?)
Header matches RegEx:
=\?utf-8\?B\?[56]
References
A spate of spam from enom-registered domains is described here.
A disappointing case where Google is not operating their Gmail service as a white-glove service is described here.