I am using the following code to convert a data frame to a tidy data frame:
replace_reg <- "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&|<|>|RT|https"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
tidy_tweets <- tweets %>%
filter(!str_detect(text, "^RT")) %>%
mutate(text = str_replace_all(text, replace_reg, "")) %>%
unnest_tokens(word, text, token = "regex", pattern = unnest_reg) %>%
filter(!word %in% custom_stop_words2$word,
str_detect(word, "[a-zäöüß]"))
However, this produces a tidy data frame where German characters üäöß are removed from the newly-created word column, for example, "wählen" becomes two words: "w" and "hlen," and the special character is removed.
I am trying to get a tidy data frame of German words to do text analysis and term frequencies.
Could someone point me in the right direction for how to approach this problem?
You need to replace all A-Za-z\\d
in your bracket expressions with [:alnum:]
.
The POSIX character class [:alnum:]
matches Unicode letters and digits.
replace_reg <- "https://t.co/[[:alnum:]]+|http://[[:alnum:]]+|&|<|>|RT|https"
unnest_reg <- "([^[:alnum:]_#@']|'(?![[:alnum:]_#@]))"
If you are using these pattern with stringr functions, you may also consider using [\\p{L}\\p{N}]
instead, like in
unnest_reg <- "([^\\p{L}\\p{N}_#@']|'(?![\\p{L}\\p{N}_#@]))"
where \p{L}
matches any Unicode letter and \p{N}
matches any Unicode digit.