I got a weird case in my dataframe while working with emojis in R. I want to delete all emojis for a sentiment analysis. When I do this I got some cases, where the string should be empty, but isn't. What is the problem? I would like to replace empty fields with NA
. Here a little example:
library(tidyverse)
df <- data.frame(x = c("test","♥️♥️🙌♥"))
nchar(df$x[2])
df_new <- df |>
mutate(x = str_remove_all(x, "[[:emoji:]]"))
is_empty(df_new$x[2])
Now I would like to use the following command, but this doesn't work, because the string is not empty.
tmp <- df_new |>
mutate(x = na_if(x, ""))
What is the problem here and how I can solve this?
Thank you in advance,
Aaron
If you want to remove all non-characters but support any language and if you already split up your x values as words you can simply do:
df <- data.frame(x = c("test","♥️♥️🙌♥"))
library(stringi)
df %>%
mutate(x = stri_extract_all(x, charclass = "\\p{L}"))
x
1 test
2 NA
If you have strings with multiple words you can slightly adapt above and use this instead
df <- data.frame(x = c("Ελλάδα means Greece ♥️ ️", "test","♥️♥️🙌♥"))
df %>%
group_by(x) %>%
mutate(x = paste(stri_extract_all(x, charclass = "\\p{L}")[[1]], collapse = " "))
1 Ελλάδα means Greece
2 test
3 NA