Spamassassin matching Han/chinese characters

I'm trying to implement a rule that matches all chinese characters (Han) with

SpamAssassin version 3.3.1 running on Perl version 5.10.1

So far I tried the following rules:

body SPAM44 /\p{Han}/
body SPAM44 /[\x{4e00}-\x{9FFF}]/
body SPAM44 /[一-俿倀-忿怀-濿瀀-翿耀-迿退-龥]+/

The first 2 rules just don't match anything. The Last rule matches nearly all my mail. All these rules work fine on regex101.com. So this is probably a spamassassin specific issue.

Example Body that should be matched:

--_000_7f25887479e34b8585663e5702f9ae87companyde_
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64

6L2m6Lqr5Yi26YCg5bel6Im65Y+K6KOF5aSH44CB5rG96L2m5pW06L2m6K6+6K6h5byA5Y+R5LiO
6K+V5Yi244CB5rG96L2m5bel56iL5LiO5pyN5Yqh44CB5pm66IO95Lqn57q/54mp5rWB5oqA5pyv
44CB5raC6KOF55Sf5Lqn57q/5Y+K6KOF5aSH44CB5bel5Lia5py65Zmo5Lq65oiQ5aWX5oqA5pyv
5Y+K6KOF5aSH44CB5bqV55uY5Yi26YCg5bel6Im65Y+K6KOF5aSHDQoNCg0KDQoN

I cannot post the decoded string, because stackoverflow says it's spam.

So how do I match chinese characters with spamassassin?

Solution

Unless you separately set normalize_charset 1 in your local configuration (default is 0), SpamAssassin doesn't normalize the character set to Unicode; then, there is virtually zero chance that these regular expressions will match.

In the absence of this setting, to match a Chinese character in UTF-8, your regex needs to match the UTF-8 sequence of the character, not the decoded Unicode representation.

body  SPAM44_UTF8 /[\xe4-\xe9][\x80-\xbf][\x80-\xbf]/
score SPAM44_UTF8 2

(Not entirely sure about the regex, but you get the idea.)

This obviously only works for bodies in UTF-8, so you would need to author a similar rule for any other character set you want to handle (GB2312 perhaps?) and it might produce false positives for message bodies which aren't actually UTF-8 (though the risk for that would seem rather marginal).

This rule matches a single Chinese character anywhere. Maybe you'll want to extend it to look for a sequence of, say, four or more to reduce the risk of false positives.

Perhaps normalize_charset 1 will become the default one day, but with the current state of email, I don't think that will be feasible any time soon. There are simply too many cases where the character set information is missing or incorrect, and heuristics to fix it automatically are brittle and error-prone.