Search code examples
pythonregexunicodespecial-characterspython-unicode

How to remove special characters from strings in python?


I have millions of strings scraped from web like:

s = 'WHAT\xe2\x80\x99S UP DOC?'
type(s) == str # returns True

Special characters like in the string above are inevitable when scraping from the web. How should one remove all such special characters to retain just clean text? I am thinking of regular expression like this based on my very limited experience with unicode characters:

\\x.*[0-9]

Solution

  • The special characters are not actually multiple characters long, that is just how they are represented so your regex isn't going to work. If you print you will see the actual unicode (utf-8) characters

    >>> s = 'WHAT\xe2\x80\x99S UP DOC?'
    >>> print(s)
    WHATâS UP DOC?
    >>> repr(s)
    "'WHATâ\\x80\\x99S UP DOC?'"
    

    If you want to print only the ascii characters you can check if the character is in string.printable

    >>> import string
    >>> ''.join(i for i in s if i in string.printable)
    'WHATS UP DOC?'