Remove words from a dataframe which are the same in different place

Having words in a dfm like this library("quanteda")

Package version: 2.1.2

dfmat <- dfm(c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other"))

which for example the tokens "hello_text" and "text_hello" are the same in different place. How is it possile to keep only one of this options?

Example output

dfmat <- dfm(c("hello_text","test1_test2",  "test2_test2_test2", "test2_other", "other"))

I found this solution example but it removes the same words

Solution

Splitting the strings at the underscore and sort them alphabetically, then use this list to identify duplicates and apply it to the original list:

words <- c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other")

words_sorted <- sapply(sapply(words, strsplit, "_"), sort)

words[!duplicated(words_sorted)]

Returns:

[1] "hello_text"        "test1_test2"       "test2_test2_test2" "test2_other"      
[5] "other"