if I have a vector
x <- c("ajjss","acdjfkj","auyjyjjksjj")
and do:
y <- x[grep("jj",x)]
table(y)
I get:
y
ajjss auyjyjjksjj
1 1
However the second string "auyjyjjksjj" should count the substring "jj" twice. How can I change this from a true/false computation, to actually counting the frequency of "jj"?
Also if for each string the frequency of the substring divided by the string's length could be calculated that would be great.
Thanks in advance.
I solved this using gregexpr()
x <- c("ajjss","acdjfkj","auyjyjjksjj")
freq <- sapply(gregexpr("jj",x),function(x)if(x[[1]]!=-1) length(x) else 0)
df<-data.frame(x,freq)
df
# x freq
#1 ajjss 1
#2 acdjfkj 0
#3 auyjyjjksjj 2
And for the last part of the question, calculating frequency / string length...
df$rate <- df$freq / nchar(as.character(df$x))
It is necessary to convert df$x back to a character string because data.frame(x,freq) automatically converts strings to factors unless you specify stringsAsFactors=F.
df
# x freq rate
#1 ajjss 1 0.2000000
#2 acdjfkj 0 0.0000000
#3 auyjyjjksjj 2 0.1818182