Search code examples
emailcommand-line-interfacetext-classificationnon-englishprocmail

How do I categorise non-english email using procmail and command line tools?


I am subscribed to a mail list where some of the messages are non-english which I cannot understand.

How do I filter the non-english messages to /dev/null using procmail and/or command line tools?

I use procmail to filter my email, so ideally any alternative tool would also require a procmail recipe.

I'd prefer not to have to train my own language models.


Solution

  • One way is to use the perl TextCat package from Gertjan van Noord.

    The text_cat script outputs the most likely language for the mail. This recipe assumes text_cat has been installed under /usr/local/bin.

    Here is a simple procmail recipe to call the text_cat script:

    :0
    * ^Subject.*Jobs.*Board
    {
        LANG_=`/usr/local/bin/text_cat`
    
        :0
        * ! LANG ?? ^english$
        /dev/null
    
        :0
        jobs/
    }
    

    I've been running text_cat for a few years. There haven't been any non-english messages classified as english, that is, no false-positives. I've not been rigorous about checking for false-negatives.


    A second way, as mentioned by tripleee in a comment, is to use the language categorisation provided by spamassassin which also uses the text_cat script. Spamassassin will unwrap any MIME transfer encodings which the vanilla text_cat version above won't.

    Here is an incompletely tested procmail recipe for filtering on the spamassassin X-Spam-Languages header:

    :0
    * ^Subject.*Jobs.*Board
    {    
        # Delete non-english language emails using spamassassin header
        # Test for not X-Spam-Languages: en
        :0
        * !^X-Spam-Languages: en$
        foreign/
    
        # Save english language mails in folder
        :0
        jobs/
    }
    

    Warning: spamassassin will occasionally provide multiple language categorisations like so:

    X-Spam-Languages: en da ro
    

    which the above recipe does not account for.

    Spamassassin Language Categorisation Configuration

    Edit /etc/spamassassin/v310.pre and uncomment the following line:

    loadplugin Mail::SpamAssassin::Plugin::TextCat
    

    Configure the plugin in /etc/spamassassin/local.cf:

    ok_languages en       # I understand english
    inactive_languages '' # Enable all languages
    add_header all Languages _LANGUAGES_
    # score UNWANTED_LANGUAGE_BODY 5 # Increase score - not necessary and not recommended 
    

    This recipe was incompletely tested with spamassassin version 3.4.2.


    To adapt these answers to excluding a different language would involve substituting the other language for english in the first case and substituting the other 2 character language code for en in the second case.