Search code examples
pythonregexweibo

Using Python regex to identify retweeters from tweets with Chinese characters


Given a tweet of Sina Weibo:

  tweet = "//@lilei: dd //@Bob: cc//@Girl: dd//@魏武: 利益所致 自然念念不忘// @诺什: 吸引优质  客户,摆脱屌丝男!!!//@MarkGreene: 转发微博"

Note that there is a space between // and @诺什.

I want to get a list of retweeters, like this:

  result = ['lilei', 'Bob', 'Girl', '魏武', 'MarkGreene']

I have been thinking about using the following script:

RTpattern = r'''//?@(\w+)'''
rt = re.findall(RTpattern, tweet) 

However, I failed in getting the Chinese word '魏武'.


Solution

  • Use the re.UNICODE flag:

    re.UNICODE
    Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character 
    properties database.
    

    tweet = u"//@lilei: dd //@Bob: cc//@Girl: dd//@魏武: 利益所致 自然念念不忘// @诺什: 吸引优质  客户,摆脱屌丝男!!!//@MarkGreene: 转发微博"
    RTpattern = r'''//?@(\w+)'''
    for word in re.findall(RTpattern, tweet, re.UNICODE):
        print word
    
    # lilei
    # Bob
    # Girl
    # 魏武
    # MarkGreene