Search code examples
pythonpython-2.7utf-8character-encodingstring-parsing

Remove all hex characters from string in Python


Although there are similar questions, I can't seem to find a working solution for my case:

I'm encountering some annoying hex chars in strings, e.g.

'\xe2\x80\x9chttp://www.google.com\xe2\x80\x9d blah blah#%#@$^blah'

What I need is to remove these hex \xHH characters, and them alone, in order to get the following result:

'http://www.google.com blah blah#%#@$^blah'

decoding doesn't help:

s.decode('utf8') # u'\u201chttp://www.google.com\u201d blah blah#%#@$^blah'

How can I achieve that?


Solution

  • Just remove all non-ASCII characters:

    >>> s.decode('utf8').encode('ascii', errors='ignore')
    'http://www.google.com blah blah#%#@$^blah'
    

    Other possible solution:

    >>> import string
    >>> s = '\xe2\x80\x9chttp://www.google.com\xe2\x80\x9d blah blah#%#@$^blah'
    >>> printable = set(string.printable)
    >>> filter(lambda x: x in printable, s)
    'http://www.google.com blah blah#%#@$^blah'
    

    Or use Regular expressions:

    >>> import re
    >>> re.sub(r'[^\x00-\x7f]',r'', s) 
    'http://www.google.com blah blah#%#@$^blah'
    

    Pick your favorite one.