Search code examples
rstringmatchingfactors

Collapse factor levels based on matching characters


I have many factor vectors in a tibble. It's a survey, so the levels are letter codes.

The survey tool incorporates order of letter chosen at the time of the survey (from a clicker), which may or may not be useful depending on the question.

I am seeking a tidy function or a process by which to collapse the factor levels with matching letters. I.e., "B,A" = "A,B" and this collapses to just "A,B".

Or "B,C,A" = "C,A,B" = "A,B,C" or any combination of the letters A,B,C. I can have up to 5 letters max in a factor level, so it can get complicated quickly.

Should I convert it to a character string and then use stringi or grepl to break it into multiple columns? I have numerous columns, so I am looking for a slick solution. Any ideas?

Here is an example of a simple string in my data:

string<-c("E","C","A","A,B","A,B,C","B,A","C,A,B") %>% as.factor()

Solution

  • split by comma, sort, paste together.

    string %>% strsplit(split = ",", fixed = TRUE) %>%
      lapply(sort) %>%
      sapply(paste, collapse = ",") %>%
      factor
    # [1] E     C     A     A,B   A,B,C A,B   A,B,C
    # Levels: A A,B A,B,C C E