I have Data
with 10 000 Observations
, variable named Com
, type Factor
with 3000 Levels
. What I'm trying to do here is to find similar pattern between values in variable Com
and then combine it into one. So, I can do analysis on it later. The str
of Data
is as below:
> Data
'data.frame': 10000 obs. of 1 variable:
$ Com: Factor w/ 3000 levels
Example: Frequency
of Com
:
> Frequency<-data.frame(Com=c("C/C++ PROGRAMMING", "C; C++ PROGRAMMING", "C++ PROGRAMMING", "C++", "PROGRAMMING C++", "C", "C PROGRAMMING", "C, C++ PROGRAMMING", "PROGRAMMING IN C; C++", "PROGRAMMINGS IN C/C++","PROGRAMMING IN C/C++", "PROGRAMMING (C, C++, CUDA)"), Freq=c(2,3,3,1,2,5,6,2,1,3,4,5))
> Frequency
Com Freq
1 C/C++ PROGRAMMING 2
2 C; C++ PROGRAMMING 3
3 C++ PROGRAMMING 3
4 C++ 1
5 PROGRAMMING C++ 2
6 C 5
7 C PROGRAMMING 6
8 C, C++ PROGRAMMING 2
9 PROGRAMMING IN C; C++ 1
10 PROGRAMMINGS IN C/C++ 3
11 PROGRAMMING IN C/C++ 4
12 PROGRAMMING (C, C++, CUDA) 5 # Just add one more situation
I want the result of Frequency
to be:
> Frequency
Com Freq
1 C/C++ PROGRAMMING 15
2 C++ PROGRAMMING 6
3 C PROGRAMMING 11
4 PROGRAMMING (C, C++, CUDA) 5
I can recode the levels of Com
in order to this. However, there are 3000 Levels
for this variable (Com)
and I have to find it one by one which going to take my time.
So, is there any other method to do this without taking so much time?
I have tried looking at Pattern matching and replacement in R
, but still can't solve the problem.
Thanks in advance.
You can do in some steps using regular expressions:
dat$Freq <- as.numeric(dat$Freq)
dat$Com[grep('.*(C).*(C[++]).*',dat$Com)] <- 'ccplusplus'
dat$Com[grep('C[++]',dat$Com)] <- 'cplusplus'
dat$Com[grep('C',dat$Com)] <- 'c'
tapply(dat$Freq,dat$Com,sum)
# c ccplusplus cplusplus
# 11 15 6