Search code examples
phpstringtokenizetweetspreg-split

PHP tokenize a tweet in words, punctuation, hashtag, mentions, emoticons


I would like to tokenize a tweet. As you probably know, tweets usually have informal forms, as follow:

This is a common Tweet #format where @mentions and.errors!!!!like this:-))))) might #appear❤ ❤☺❤#ThisIsAHashtag!?!

You may also have emoji in UNICODE format (heart, smiles, etc). I'm working on a preg_split to tokenize. The desidered ouput is:

This
is
a
common
Tweet
#format
where
@mentions
and
.
errors
!!!!
like
this
:-)))))
might
#appear
❤
❤
☺
❤
#ThisIsAHashtag
!?!

The current preg_split I've implemented so far is:

preg_split('/(?<=\s)|(?<=\w)(?=[.,:;!?(){}-])|(?<=[.,!()?\x{201C}])(?=[^ ])/u', $tweet);

Any help is appreciate.


Solution

  • You can use this pattern with preg_match_all:

    ~[#@]?\w+|\pP+|\S~u
    

    online demo

    Note: You can easily extend this pattern if you need to group another kind of characters. Example with currency:

    ~[#@]?\w+|\pP+|\p{Sc}+|\S~u