Search code examples
rregextwitter

R tweets with emojis


I scrapped tweets from the twitter API and the package rtweet but I don't know how to work with text with emojis because they are in the form '\U0001f600' and all the regex code that I tried failed until now. I can't get anything of it.

For example

 text = 'text text. \U0001f600'
 grepl('U',text)

Give me FALSE

 grepl('000',text)

Also give me FALSE.

Another problem is that they are often sticked to the word before (for example i am here\U0001f600 )

So how can I make R recognize emojis of that format? What can I put in the grepl that will return me TRUE for any emojis of that format?


Solution

  • In R there tends to be a package for most things. And in this case textclean and with it comes the lexicon package which has a lot of dictionaries. Using textclean you have 2 functions you can use, replace_emoji and replace_emoji_identifier

    text = c("text text. \U0001f600", "i am here\U0001f600")
    
    # replace emoji with identifier:
    textclean::replace_emoji_identifier(text)
    [1] "text text. lexiconvygwtlyrpywfarytvfis " "i am here lexiconvygwtlyrpywfarytvfis " 
    
    # replace emoji with text representation
    textclean::replace_emoji(text)
    [1] "text text. grinning face " "i am here grinning face " 
    

    Next you could use sentimentr to use sentiment scoring on the emoji's or for text analysis quanteda. If you just want to check the presence as in your expected output:

    grepl("lexicon[[:alpha:]]{20}", textclean::replace_emoji_identifier(text))
    [1] TRUE TRUE