{"id":20,"date":"2006-10-07T12:12:27","date_gmt":"2006-10-07T11:12:27","guid":{"rendered":"\/Blog\/?p=20"},"modified":"2009-10-16T11:57:59","modified_gmt":"2009-10-16T10:57:59","slug":"mailer-charsets-and-spam","status":"publish","type":"post","link":"https:\/\/seegras.discordia.ch\/Blog\/mailer-charsets-and-spam\/","title":{"rendered":"Mailer, Charsets and Spam"},"content":{"rendered":"<p>I took a further look at charsets in mail, in respect to what I need to receive, and what spammers use. In theory, a mailer should always use the least necessary charset, us-ascii that is, unless the user types some non-ascii sign, in which case it should use the ISO-8859 charset if appropriate, or UTF8. If you configured it to use, say 8859-1, and you type an umlaut, it sets the charset to ISO-88859-1, if you type a cyrillic character, it should use UTF8. If you configured ISO-8859-5, it should use this for cyrillic, and UTF8 if you type an umlaut. Simple. So you only need us-ascii, ISO-8859 and UTF8. <\/p>\n<p>Now there are some braindead and\/or obsolete mail-programs, which use different and outdated charsets. As it happens, you will have contact to people whose mails appear in a hodge-podge of non-standardized charsets, most notably windows-125X. Now, for a western-european german-speaking context, where most mails are either german or english, with some very little french or spanish thrown in, I did some statistics regarding the charsets of spam and ham. <\/p>\n<p>This is a sample of 1220 legitimate Mails: <\/p>\n<blockquote><p>\n83.6% iso-8859-1<br \/>\n5.3% us-ascii<br \/>\n4.6% utf-8<br \/>\n4.5% iso-8859-15<br \/>\n1.5% windows-1252\n<\/p><\/blockquote>\n<p>The rest half percent is negligible and consists of some other iso-8859 charsets. <\/p>\n<p>Now with spam, this looks quite different, the sample here are 6251 spam-mails:<\/p>\n<blockquote><p>\n15.5% iso-8859-1<br \/>\n39.5% us-ascii<br \/>\n00.6% utf-8<br \/>\n00.1% iso-8859-15<br \/>\n14.7% windows-1252\n<\/p><\/blockquote>\n<p>Now, where&#8217;s the rest? Its mostly a huge amount of the windows us-ascii replacement windows-1250, which is a completely superfluous charset:<\/p>\n<blockquote><p>\n23.9% windows-1250<br \/>\n1.5% iso-2022-jp<br \/>\n1.3% koi8-r<br \/>\n1.1% iso-8859-2<br \/>\n0.5% windows-1255<br \/>\n0.3% windows-1254\n<\/p><\/blockquote>\n<p>The remaining percent are japanese, chinese and russian charsets, plus the remaining windows-125X-charsets. <\/p>\n<p>So now I can get rid of 25 percent of spam by not allowing chinese and other east-asian charsets, russian koi8 and most windows-125X charsets except windows-1252. I could get rid of another 14.7% by blocking this also, but that would piss of 1.5% of my legitimate contacts. <\/p>\n<p>The blocking can be accomplished by simply putting some rule into .procmailrc:<br \/>\n<code><br \/>\n:0:<br \/>\n* ^Content-Type.*windows-1250.*<br \/>\nspam<br \/>\n<\/code><\/p>\n<p>Or by giving it a score in spamassassin<br \/>\n<code><br \/>\n\/\/begin<br \/>\nheader __ILLEGAL_CHARSET_1  Content-Type =~ windows-1250\/i<br \/>\nmeta ILLEGAL_CHARSETS (__ILLEGAL_CHARSET_1 + __ILLEGAL_CHARSET_2>= 1)<br \/>\nscore  ILLEGAL_CHARSETS 3<br \/>\ndescribe ILLEGAL_CHARSETS foreign obsolete charsets<br \/>\n<\/code><\/p>\n<p>Or by refusing it completely during transmission  in postfix&#8217; \/etc\/postfix\/headercheck.pcre<br \/>\n<code><br \/>\n\/^Content-Type:.*\\bcharset=\"?(?:<br \/>\n        windows-1250    |<br \/>\n        windows-1251    |<br \/>\n        windows-1253    |<br \/>\n        windows-1254    |<br \/>\n        windows-1255    |<br \/>\n        windows-1256    |<br \/>\n        windows-1257    |<br \/>\n        windows-1258    |<br \/>\n        windows-874)\\b\/  REJECT Illegal Charset<br \/>\n<\/code><\/p>\n<p>\/etc\/postfix\/headercheck.pcre can also be used to block very selectively; if for instance an &#8220;unsubscribe&#8221; does not get honoured (it&#8217;s of course useless against normal spammers since they constantly change addresses):<br \/>\n<code><br \/>\n\/^From: .*test@example.com\/ REJECT I said NO, you spamming moron<br \/>\n<\/code><\/p>\n<p>And if you want to test a rule, save an email as &#8220;testcase&#8221; or something and check if it matches:<br \/>\n<code><br \/>\npostmap -q - pcre:\/etc\/postfix\/headercheck.pcre < testcase\n<\/code><br \/>\n<\/code><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I took a further look at charsets in mail, in respect to what I need to receive, and what spammers use. In theory, a mailer should always use the least necessary charset, us-ascii that is, unless the user types some non-ascii sign, in which case it should use the ISO-8859 charset if appropriate, or UTF8. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[],"class_list":["post-20","post","type-post","status-publish","format-standard","hentry","category-computers"],"_links":{"self":[{"href":"https:\/\/seegras.discordia.ch\/Blog\/wp-json\/wp\/v2\/posts\/20","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/seegras.discordia.ch\/Blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/seegras.discordia.ch\/Blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/seegras.discordia.ch\/Blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/seegras.discordia.ch\/Blog\/wp-json\/wp\/v2\/comments?post=20"}],"version-history":[{"count":3,"href":"https:\/\/seegras.discordia.ch\/Blog\/wp-json\/wp\/v2\/posts\/20\/revisions"}],"predecessor-version":[{"id":122,"href":"https:\/\/seegras.discordia.ch\/Blog\/wp-json\/wp\/v2\/posts\/20\/revisions\/122"}],"wp:attachment":[{"href":"https:\/\/seegras.discordia.ch\/Blog\/wp-json\/wp\/v2\/media?parent=20"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/seegras.discordia.ch\/Blog\/wp-json\/wp\/v2\/categories?post=20"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/seegras.discordia.ch\/Blog\/wp-json\/wp\/v2\/tags?post=20"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}