Search code examples
pythonemailbyte-order-mark

Format Python String with BOM


I have a string that (I think) has BOM inside of it and I would like to remove all the BOM without messing with the format.

For example my string looks like this:

>=20
> =EF=BB=BF
>=20
> -Jeff
>=20
> Begin forwarded message:
>=20

And I would like it to look like:

>
>
>
> -Jeff
>
> Begin forwarded message:
>

I am fine with the > being left to indicate indention I just want the stray characters removed. If I decode the message then I get a string that is uglier and hard to read than what I already have. It has a bunch of \r\n\r\n in it from the line breaks so ideally id like to just remove the things mentioned leaving the format alone.

Edit 1: Here is how I am getting to this point:

def getEmails():
    LOG.debug("Starting to get emails")
    conn = connectToMailServers()
    conn.select('inbox', readonly=True )
    result, data = conn.search(None, '(UNSEEN)')
    mail_ids = data[0]

    id_list = mail_ids.split()

    for _, i in enumerate(id_list):
        result, data = conn.fetch(str(int(i)), '(RFC822)' )
        for response_part in data:
            if isinstance(response_part, tuple):
                msg = email.message_from_bytes(response_part[1])
                getPlainText(msg)

def getPlainText(msg):
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            LOG.debug(part.get_payload())
            return str(part.get_payload())

If I turn on decoding (part.get_payload(decode=True)) then I get into an issue of the string now having a bunch of \r\n\r\n so how can I do this without decode OR how can I reformat this into a formatted string removing the line breaks


Solution

  • Explicitly telling str converter to use UTF-8 worked,

    str(getPlainText(msg), "utf-8")

    Gave me the expected results I was looking for.