Search code examples
pythonemailsieve-language

Why is my script is not consistently detecting contents in email bodies?


I've setup a sieve filter which invokes a Python script when it detects a postal service email about package deliveries. The sieve filter works fine and invokes the Python script reliably. However, the Python script does not reliably do its work. Here is my Python script, reduced to the relevant parts:

#!/usr/bin/env python3

import sys
from email import message_from_file
from email import policy
import subprocess

msg = message_from_file(sys.stdin, policy=policy.default)
if " out for delivery " in str(msg.get_body(("html"))):
    print("It is out for delivery")

I get email messages that have the string " out for delivery " in the body of the message but the script does not print out "It is out for delivery". I've already checked the HTML in the messages to make sure it is consistent and it is 100% consistent. The frustrating thing though is that if I save the message from my mail reader that should have triggered the script, and I feed it to sieve-test manually, then the script works 100% of the time!

How come my script never works during actual mail delivery but always works whenever I test it with sieve-test?

Notes:

  1. The email contains only a single part, which is HTML, so I have to use the HTML part.

  2. I know I can do a body test in sieve. I'm doing it in Python for reasons outside the scope of this question.


Solution

  • The problem is that you use str(msg.get_body(("html"))), which is unreliable for your purpose. What you get is the body of the message as a string, but it is encoded for inclusion inside an email message. You're dealing with MIME part, which may be encoded with quoted-printable, in which case the string you test for (" out for delivery ") could be split across multiple lines when encoded. The string against which you test could have the text you are looking for encoded like this:

    [other text] out for=
    delivery [more text]
    

    The = sign is part of the encoding and indicates that the newline that follows is there because of the encoding rather than because it was there prior to encoding.

    Ok, but why does it always work when you use sieve-test? What happens is that your mail reader encodes the message differently, and the way it encodes it, the text you are looking for is not split across lines, and your script works! It is perfectly correct for the mail reader to save the message with a different encoding so long as once the email is decoded its content has not changed.

    What you should do is use msg.get_body(("html")).get_content(). This gets the body in decoded form exactly byte-for-byte the same as when the postal service composed the email.