Search code examples
pythonregexpython-unicodeunicode-escapes

Remove \u from string?


I have a few words in a list that are of the type '\uword'. I want to replace the '\u' with an empty string. I looked around on SO but nothing has worked for me so far. I tried converting to a raw string using "%r"%word but that didn't work. I also tried using word.encode('unicode-escape') but haven't gotten anywhere. Any ideas?

EDIT

Adding code

word = '\u2019'
word.encode('unicode-escape')
print(word) # error

word = '\u2019'
word = "%r"%word
print(word) # error

Solution

  • I was making an error in assuming that the .encode method of strings modifies the string inplace similar to the .sort() method of a list. But according to the documentation

    The opposite method of bytes.decode() is str.encode(), which returns a bytes representation of the Unicode string, encoded in the requested encoding.

    def remove_u(word):
        word_u = (word.encode('unicode-escape')).decode("utf-8", "strict")
        if r'\u' in word_u: 
            # print(True)
            return word_u.split('\\u')[1]
        return word
    
    vocabulary_ = [remove_u(each_word) for each_word in vocabulary_]