Search code examples
pythonregexencodingmime

Regex to Filter out ">" and "=20" from email.message_from_string


I want to write a regular expression to filter out all junk out of an email that is being pulled in through imaplib and email modules in my Python script below. I'm thinking a regex is best but feel free to suggest better solutions. Any idea why the email text has a equals in the word be=tter below? The original email has it as better.

Python snippet:

emailMessage = email.message_from_string
print emailMessage.get_payload():

Print Text:

>=20
> >>>>
> >>>> Hope this makes it through you spam filter but couldn't think of a be=
tter subject.
> >>>>

Solution

  • As Karl Knechtel says in the comments, your message is encoded as quoted-printable. To decode that, use quopri.decodestring():

    import quopri
    
    decoded = quopri.decodestring(emailMessage.get_payload())
    

    Using regexes to strip out the "junk" characters is going to be inefficient, and also means that whenever a new one turns up in your input down the line, you'll have to modify your code.

    However, if after decoding you want to lose the > characters [and any whitespace betwwen them] at the beginning of each line, then for that, a regex is a reasonable solution:

    import re
    
    chevrons = re.compile("(?m)^[> ]*")
    stripped = re.sub(chevrons, "", decoded)
    

    (?m) indicates that the regex is multiline, by the way.