Search code examples
rregextwitterunicodeemoji

Regex to remove everything, but emojis from the string in R?


I have a big .xlsx file containing tweets with emojis. I am working on a personal project where I want to make a network graph from the extracted emojis. For example, if I have this in one of the columns:

Christian✝️, Husband👫, Father👨‍👩‍👦‍👦, Former TV 📺Meteorologist🌪, GOP🐘, LTC 🔫, Dolfan🐬, since ‘75, Yanks Fan⚾️ & UCONN Alum🏀 Go Whalers🐋!

So how would I only get this as on output?

✝️👫👨‍👩‍👦‍👦📺🌪🐘🔫🐬⚾️🏀🐋

I have looked thoroughly everywhere, in Stack Overflow and over the internet, however I couldn't find anything. I am a beginner in R.

Edit

I am getting the Unicode (in UTF-8 format) when I normally read the file, but I don't know how to turn those Unicode to the emojis. There are dictionaries online, but they only give me the name of some of these emojis, they are very outdated.

Edit 2

There is a solution that works in Linux, but I am seeking a solution/hint to get this to work in the Windows.


Solution

  • This works for me, with the caveat only the cross prints out as an emoji in the console, the rest are the unicode representation.

    # install.packages("remotes")
    # remotes::install_github("hadley/emo")
    emojis <- "Christian✝️, Husband👫, Father👨‍👩‍👦‍👦, Former TV 📺Meteorologist🌪, GOP🐘, LTC 🔫, Dolfan🐬, since ‘75, Yanks Fan⚾️ & UCONN Alum🏀 Go Whalers🐋!"
    emojis
    only_emojis <- emo::ji_extract_all(emojis)
    only_emojis
    
    #  emo::ji_extract_all(emojis)
    # [[1]]
    #  [1] "✝️"      "\U0001f46b"      "\U0001f468"      "\U0001f469"      "\U0001f466"      "\U0001f466"      "\U0001f4fa"      "\U0001f418"      "\U0001f52b"      "\U0001f42c"      "\u26be" "\U0001f3c0"      "\U0001f40b"   
    
    # install.packages("utf8")
    utf8::utf8_print(only_emojis[[1]])  
    # [1] "✝️​" "👫​" "👨​" "👩​" "👦​" "👦​" "📺​" "🐘​" "🔫​" "🐬​" "⚾​" "🏀​" "🐋​"