Search code examples
rquanteda

Remove words from a dataframe which are the same in different place


Having words in a dfm like this library("quanteda")

Package version: 2.1.2

dfmat <- dfm(c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other"))

which for example the tokens "hello_text" and "text_hello" are the same in different place. How is it possile to keep only one of this options?

Example output

dfmat <- dfm(c("hello_text","test1_test2",  "test2_test2_test2", "test2_other", "other"))

I found this solution example but it removes the same words


Solution

  • Splitting the strings at the underscore and sort them alphabetically, then use this list to identify duplicates and apply it to the original list:

    words <- c("hello_text","text_hello","test1_test2", "test2_test1", "test2_test2_test2", "test2_other", "other")
    
    words_sorted <- sapply(sapply(words, strsplit, "_"), sort)
    
    words[!duplicated(words_sorted)]
    

    Returns:

    [1] "hello_text"        "test1_test2"       "test2_test2_test2" "test2_other"      
    [5] "other"