Given a tweet of Sina Weibo:
tweet = "//@lilei: dd //@Bob: cc//@Girl: dd//@魏武: 利益所致 自然念念不忘// @诺什: 吸引优质 客户,摆脱屌丝男!!!//@MarkGreene: 转发微博"
Note that there is a space between // and @诺什.
I want to get a list of retweeters, like this:
result = ['lilei', 'Bob', 'Girl', '魏武', 'MarkGreene']
I have been thinking about using the following script:
RTpattern = r'''//?@(\w+)'''
rt = re.findall(RTpattern, tweet)
However, I failed in getting the Chinese word '魏武'.
Use the re.UNICODE
flag:
re.UNICODE
Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character
properties database.
tweet = u"//@lilei: dd //@Bob: cc//@Girl: dd//@魏武: 利益所致 自然念念不忘// @诺什: 吸引优质 客户,摆脱屌丝男!!!//@MarkGreene: 转发微博"
RTpattern = r'''//?@(\w+)'''
for word in re.findall(RTpattern, tweet, re.UNICODE):
print word
# lilei
# Bob
# Girl
# 魏武
# MarkGreene