Mailer, Charsets and Spam
Saturday, October 7th, 2006I took a further look at charsets in mail, in respect to what I need to receive, and what spammers use. In theory, a mailer should always use the least necessary charset, us-ascii that is, unless the user types some non-ascii sign, in which case it should use the ISO-8859 charset if appropriate, or UTF8. If you configured it to use, say 8859-1, and you type an umlaut, it sets the charset to ISO-88859-1, if you type a cyrillic character, it should use UTF8. If you configured ISO-8859-5, it should use this for cyrillic, and UTF8 if you type an umlaut. Simple. So you only need us-ascii, ISO-8859 and UTF8.
Now there are some braindead and/or obsolete mail-programs, which use different and outdated charsets. As it happens, you will have contact to people whose mails appear in a hodge-podge of non-standardized charsets, most notably windows-125X. Now, for a western-european german-speaking context, where most mails are either german or english, with some very little french or spanish thrown in, I did some statistics regarding the charsets of spam and ham.
This is a sample of 1220 legitimate Mails:
83.6% iso-8859-1
5.3% us-ascii
4.6% utf-8
4.5% iso-8859-15
1.5% windows-1252
The rest half percent is negligible and consists of some other iso-8859 charsets.
Now with spam, this looks quite different, the sample here are 6251 spam-mails:
15.5% iso-8859-1
39.5% us-ascii
00.6% utf-8
00.1% iso-8859-15
14.7% windows-1252
Now, where’s the rest? Its mostly a huge amount of the windows us-ascii replacement windows-1250, which is a completely superfluous charset:
23.9% windows-1250
1.5% iso-2022-jp
1.3% koi8-r
1.1% iso-8859-2
0.5% windows-1255
0.3% windows-1254
The remaining percent are japanese, chinese and russian charsets, plus the remaining windows-125X-charsets.
So now I can get rid of 25 percent of spam by not allowing chinese and other east-asian charsets, russian koi8 and most windows-125X charsets except windows-1252. I could get rid of another 14.7% by blocking this also, but that would piss of 1.5% of my legitimate contacts.
The blocking can be accomplished by simply putting some rule into .procmailrc:
:0:
* ^Content-Type.*windows-1250.*
spam
Or by giving it a score in spamassassin
//begin
header __ILLEGAL_CHARSET_1 Content-Type =~ windows-1250/i
meta ILLEGAL_CHARSETS (__ILLEGAL_CHARSET_1 + __ILLEGAL_CHARSET_2>= 1)
score ILLEGAL_CHARSETS 3
describe ILLEGAL_CHARSETS foreign obsolete charsets
Or by refusing it completely during transmission in postfix’ /etc/postfix/headercheck.pcre
/^Content-Type:.*\bcharset="?(?:
windows-1250 |
windows-1251 |
windows-1253 |
windows-1254 |
windows-1255 |
windows-1256 |
windows-1257 |
windows-1258 |
windows-874)\b/ REJECT Illegal Charset
/etc/postfix/headercheck.pcre can also be used to block very selectively; if for instance an “unsubscribe” does not get honoured (it’s of course useless against normal spammers since they constantly change addresses):
/^From: .*test@example.com/ REJECT I said NO, you spamming moron
And if you want to test a rule, save an email as “testcase” or something and check if it matches:
postmap -q - pcre:/etc/postfix/headercheck.pcre < testcase