I have a dataframe with text and I want to extract the character-level bigrams (n = 2), e.g. "st", "ac", "ck", for each text in R.
I also want to count the frequency of each character-level bigram in the text.
Data:
df$text
[1] "hy my name is"
[2] "stackover flow is great"
[3] "how are you"
I'm not quite sure of your expected output here. I would have thought that the bigrams for "stack" would be "st", "ta", "ac", and "ck", since this captures each consecutive pair.
For example, if you wanted to know how many instances of the bigram "th" the word "brothers" had in it, and you split it into the bigrams "br", "ot", "he" and "rs", then you would get the answer 0, which is wrong.
You can build up a single function to get all bigrams like this:
# This function takes a vector of single characters and creates all the bigrams
# within that vector. For example "s", "t", "a", "c", "k" becomes
# "st", "ta", "ac", and "ck"
pair_chars <- function(char_vec) {
all_pairs <- paste0(char_vec[-length(char_vec)], char_vec[-1])
return(as.vector(all_pairs[nchar(all_pairs) == 2]))
}
# This function splits a single word into a character vector and gets its bigrams
word_bigrams <- function(words){
unlist(lapply(strsplit(words, ""), pair_chars))
}
# This function splits a string or vector of strings into words and gets their bigrams
string_bigrams <- function(strings){
unlist(lapply(strsplit(strings, " "), word_bigrams))
}
So now we can test this on your example:
df <- data.frame(text = c("hy my name is", "stackover flow is great",
"how are you"), stringsAsFactors = FALSE)
string_bigrams(df$text)
#> [1] "hy" "my" "na" "am" "me" "is" "st" "ta" "ac" "ck" "ko" "ov" "ve" "er" "fl"
#> [16] "lo" "ow" "is" "gr" "re" "ea" "at" "ho" "ow" "ar" "re" "yo" "ou"
If you want to count occurrences, you can just use table
:
table(string_bigrams(df$text))
#> ac am ar at ck ea er fl gr ho hy is ko lo me my na ou ov ow re st ta ve yo
#> 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1
However, if you are going to be doing a fair bit of text mining, you should look into specific R packages like stringi
, stringr
, tm
and quanteda
that help with the basic tasks
For example, all of the base R functions I wrote above can be replaced using the quanteda
package like this:
library(quanteda)
char_ngrams(unlist(tokens(df$text, "character")), concatenator = "")
#> [1] "hy" "ym" "my" "yn" "na" "am" "me" "ei" "is" "ss" "st" "ta" "ac" "ck"
#> [15] "ko" "ov" "ve" "er" "rf" "fl" "lo" "ow" "wi" "is" "sg" "gr" "re" "ea"
#> [29] "at" "th" "ho" "ow" "wa" "ar" "re" "ey" "yo" "ou"
Created on 2020-06-13 by the reprex package (v0.3.0)