Search code examples
rstringnlpemoji

Empty string with length > 0 in R


I got a weird case in my dataframe while working with emojis in R. I want to delete all emojis for a sentiment analysis. When I do this I got some cases, where the string should be empty, but isn't. What is the problem? I would like to replace empty fields with NA. Here a little example:

library(tidyverse)

df <- data.frame(x = c("test","♥️♥️🙌♥"))

nchar(df$x[2])

df_new <- df |>
  mutate(x = str_remove_all(x, "[[:emoji:]]"))

is_empty(df_new$x[2])

Now I would like to use the following command, but this doesn't work, because the string is not empty.

tmp <- df_new |>
  mutate(x = na_if(x, ""))

What is the problem here and how I can solve this?

Thank you in advance,

Aaron


Solution

  • If you want to remove all non-characters but support any language and if you already split up your x values as words you can simply do:

    df <- data.frame(x = c("test","♥️♥️🙌♥"))
    
    library(stringi)
    
    df %>%
      mutate(x = stri_extract_all(x, charclass = "\\p{L}"))
    
         x
    1 test
    2   NA
    

    If you have strings with multiple words you can slightly adapt above and use this instead

    df <- data.frame(x = c("Ελλάδα means Greece ♥️ ️", "test","♥️♥️🙌♥"))
    
    df %>%
      group_by(x) %>%
      mutate(x = paste(stri_extract_all(x, charclass = "\\p{L}")[[1]], collapse = " "))
    
    1 Ελλάδα means Greece
    2 test           
    3 NA