Search code examples
rpattern-matchingsimilarityr-factor

Find Similar Word Pattern for Factor Variable using R


I have Data with 10 000 Observations, variable named Com, type Factor with 3000 Levels. What I'm trying to do here is to find similar pattern between values in variable Com and then combine it into one. So, I can do analysis on it later. The str of Data is as below:

> Data
 'data.frame':   10000 obs. of  1 variable:
  $ Com: Factor w/ 3000 levels

Example: Frequency of Com:

> Frequency<-data.frame(Com=c("C/C++ PROGRAMMING", "C; C++ PROGRAMMING", "C++ PROGRAMMING", "C++", "PROGRAMMING C++", "C", "C PROGRAMMING", "C, C++ PROGRAMMING", "PROGRAMMING IN C; C++", "PROGRAMMINGS IN C/C++","PROGRAMMING IN C/C++", "PROGRAMMING (C, C++, CUDA)"), Freq=c(2,3,3,1,2,5,6,2,1,3,4,5))
> Frequency
                                 Com   Freq
1                  C/C++ PROGRAMMING      2
2                 C; C++ PROGRAMMING      3
3                    C++ PROGRAMMING      3
4                                C++      1
5                    PROGRAMMING C++      2
6                                  C      5
7                      C PROGRAMMING      6
8                 C, C++ PROGRAMMING      2
9              PROGRAMMING IN C; C++      1
10             PROGRAMMINGS IN C/C++      3
11              PROGRAMMING IN C/C++      4
12        PROGRAMMING (C, C++, CUDA)      5       # Just add one more situation

I want the result of Frequency to be:

> Frequency
                                 Com   Freq
1                  C/C++ PROGRAMMING     15
2                    C++ PROGRAMMING      6
3                      C PROGRAMMING     11
4         PROGRAMMING (C, C++, CUDA)      5

I can recode the levels of Com in order to this. However, there are 3000 Levels for this variable (Com) and I have to find it one by one which going to take my time.

So, is there any other method to do this without taking so much time? I have tried looking at Pattern matching and replacement in R, but still can't solve the problem.

Thanks in advance.


Solution

  • You can do in some steps using regular expressions:

    dat$Freq <- as.numeric(dat$Freq)
    dat$Com[grep('.*(C).*(C[++]).*',dat$Com)] <- 'ccplusplus'
    dat$Com[grep('C[++]',dat$Com)] <- 'cplusplus'
    dat$Com[grep('C',dat$Com)] <- 'c'
    tapply(dat$Freq,dat$Com,sum)
    
    # c ccplusplus  cplusplus 
    # 11         15          6