Search code examples
pythonlinuxbashprocmail

mail body encoding after procmail processing


I've got the following line in my .procmailrc on SMTP server:

BODY=`formail -I ""`

Later I echo this body to a local file:

echo "$BODY" >> $HOME/$FILENAME; \

I've also tried prinf (but I got the same effect):

printf "$BODY" >> $HOME/$FILENAME; \

When I read this file I can see that encoding has been change. Here's what I got:

Administrator System=C3=B3w

while it should be (in Polish):

Administrator Systemów

How to decode/encode the body either directly in .procmailrc or later (bash/python) to get the right string?

Another line in my .procmailrc works properly but it needs additional pipe with perl encoder:

SUBJECT=`formail -xSubject: | tr -d '\n' | sed -e 's/^ //' | /usr/bin/perl -MEncode -ne 'print encode ("utf8",decode ("MIME-Header",$_ )) '`

SUBJECT contains UTF8 characters and everything looks OK. Maybe there's a way to use a similar solution with the body of the mail?

OK. I finally got everything up and running. Here's what I did:

First the .procmailrc file:

VERBOSE=yes
LOGFILE=$HOME/procmail.log
:0f
* ^From.*(some_address@somedomain.com)
| $HOME/python_script.py

Now to the python_script.py:

#!/usr/bin/python

from email.parser import Parser
import sys

temp_file = open("/home/(user)/file.txt","w")
temp_file.write("START\n")

if not message.is_multipart():
        temp_file.write(message.get_payload(decode=True))
else:
        for part in message.get_payload():
                if part.get_content_type() == 'text/plain':
                        temp_file.write(part.get_payload(decode=True))

temp_file.close()

The most difficult part to debug was the .procmailrc recipe, where I had to test many options for :0, :0f, :0fbW etc... and finally found the one that suits best.

The next problematic step was the $BODY part decoded directly in .procmailrc. I figured out the solution though, by getting rid of all the stuff and moving everything to Python script. Just as tripleee suggested.


Solution

  • It is not changed, but you are zapping the headers so that the correct Content-Type: header is no longer present (you should also keep Mime-Version: and any other standard Content-* headers).

    You should see, by examining the source of the message in your mail client, that Procmail or Bash have actually not changed anything. The text you receive is in fact literally Administrator System=C3=B3w but the MIME headers inform your email client that this is Content-Transfer-Encoding: quoted-printable and Content-type: text/plain; charset="utf-8" and so it knows how to decode and display this correctly.

    If you want just the payload, you will need to decode it yourself, but in order to do that, you need this information from the MIME headers, so you should not kill them before you have handled the message (if at all). Something like this, perhaps:

    from email.parser import Parser
    import sys
    
    message = Parser().parse(sys.stdin)
    if message['content-type'].lower().startswith('text/'):
        print(message.get_payload(decode=True))
    else:
        raise DieScreamingInAnguish('aaaargh!')  # pseudo-pseudocode
    

    This is extremely simplistic in that it assumes (like your current, even more broken solution) that the message contains a single, textual part. Extending it to multipart messages is not technically hard, but how exactly you do that depends on what sort of multiparts you expect to receive, and what you want to do with the payload(s).

    Like in your previous question I would like to suggest that you move more, or all, of your email manipulation into Python, if you are going to be using it anyway. Procmail has no explicit MIME support so you would have to reinvent all of that in Procmail, which is neither simple nor particularly fruitful.