Having words in a dfm like this library("quanteda")
dfmat <- dfm(c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other"))
which for example the tokens "hello_text" and "text_hello" are the same in different place. How is it possile to keep only one of this options?
Example output
dfmat <- dfm(c("hello_text","test1_test2", "test2_test2_test2", "test2_other", "other"))
I found this solution example but it removes the same words
Splitting the strings at the underscore and sort them alphabetically, then use this list to identify duplicates and apply it to the original list:
words <- c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other")
words_sorted <- sapply(sapply(words, strsplit, "_"), sort)
words[!duplicated(words_sorted)]
Returns:
[1] "hello_text" "test1_test2" "test2_test2_test2" "test2_other"
[5] "other"