Search code examples
pythonregexemoticons

How to match a emoticon in sentence with regular expressions


I'm using Python to process Weibo (a twitter-like service in China) sentences. There are some emoticons in the sentences, whose corresponding unicode are \ue317 etc. To process the sentence, I need to encode the sentence with gbk, see below:

 string1_gbk = string1.decode('utf-8').encode('gb2312')

There will be a UnicodeEncodeError:'gbk' codec can't encode character u'\ue317'

I tried \\ue[0-9a-zA-Z]{3}, but it did not work. How could I match these emoticons in sentences?


Solution

  • Try

    string1_gbk = string1.decode('utf-8').encode('gb2312', 'replace')
    

    Should output ? instead of those emoticons.

    Python Docs - Python Wiki