Search code examples
pythonstringemailmime

Remove characters from HTML string only if they appear after specific pattern


So, I tried to google my problem but I was unable to find a satisfactory answer. I have written a program that is used to parse HTML emails. It used to work fine until now, but I guess that something has been updated in the Outlook protocol. However, now, when extracting the HTML content of the email, everything works fine except for style tags.

Everything that comes after style tag is automatically escaped for some reason. For example like this: <span style=\'color:red; background:yellow; mso-highlight:yellow\'> and <span style=\'background:yellow;mso-highlight:yellow\'> Notice how the style ' marks are escaped for some reason? This is causing problems for my software, and making it crash. I really do not need these escape markers and want to get rid of them.

So now my question is, how do I remove ONLY and ONLY (if possible) the markers in these specific places? So only after style= and also at the end of the style property, just before '>. All help is really much appreciated, I am totally and utterly stuck with no idea on how to proceed. I really wouldn't want to remove all of the backwards slashes so nothing that really needs to be escaped is not escaped.

Thanks in advance!


Solution

  • A simple regular expression should work:

    import re
    
    text = re.sub(r"style=\'(.*)\'", r"style='\1'", raw_text)