Does anyone here have experience in identifying the most common phrases (3 ~ 7 consecutive words)? Understand that most analysis on frequency focuses on the most frequent/common word (along with plotting a WordCloud) rather than phrases.
# Assuming a particular column in a data frame (df) with n rows that is all text data
# as I'm not able to provide a sample data as using dput() on a large text file won't # be feasible here
Text = df$Text_Column
docs = Corpus(VectorSource(Text))
...
Thanks in advance!
You have several options to do this in R
. Let's grab some data first. I use the books by Jane Austen from the janeaustenr
and do some cleaning to have each paragrah in a separate row:
library(janeaustenr)
library(tidyverse)
books <- austen_books() %>%
mutate(paragraph = cumsum(text == "" & lag(text) != "")) %>%
group_by(paragraph) %>%
summarise(book = head(book, 1),
text = trimws(paste(text, collapse = " ")),
.groups = "drop")
With tidytext:
library(tidytext)
map_df(3L:7L, ~unnest_tokens(books, ngram, text, token = "ngrams", n = .x)) %>% # using multiple values for n is not directly implemented in tidytext
count(ngram) %>%
filter(!is.na(ngram)) %>%
slice_max(n, n = 10)
#> # A tibble: 10 × 2
#> ngram n
#> <chr> <int>
#> 1 i am sure 415
#> 2 i do not 412
#> 3 she could not 328
#> 4 it would be 258
#> 5 in the world 247
#> 6 as soon as 236
#> 7 a great deal 214
#> 8 would have been 211
#> 9 she had been 203
#> 10 it was a 202
With quanteda:
library(quanteda)
books %>%
corpus(docid_field = "paragraph",
text_field = "text") %>%
tokens(remove_punct = TRUE,
remove_symbols = TRUE) %>%
tokens_ngrams(n = 3L:7L) %>%
dfm() %>%
topfeatures(n = 10) %>%
enframe()
#> # A tibble: 10 × 2
#> name value
#> <chr> <dbl>
#> 1 i_am_sure 415
#> 2 i_do_not 412
#> 3 she_could_not 328
#> 4 it_would_be 258
#> 5 in_the_world 247
#> 6 as_soon_as 236
#> 7 a_great_deal 214
#> 8 would_have_been 211
#> 9 she_had_been 203
#> 10 it_was_a 202
With text2vec:
library(text2vec)
library(janeaustenr)
library(tidyverse)
books <- austen_books() %>%
mutate(paragraph = cumsum(text == "" & lag(text) != "")) %>%
group_by(paragraph) %>%
summarise(book = head(book, 1),
text = trimws(paste(text, collapse = " ")),
.groups = "drop")
library(text2vec)
itoken(books$text, tolower, word_tokenizer) %>%
create_vocabulary(ngram = c(3L, 7L), sep_ngram = " ") %>%
filter(str_detect(term, "[[:alpha:]]")) %>% # keep terms with at tleas one alphabetic character
slice_max(term_count, n = 10)
#> Number of docs: 10293
#> 0 stopwords: ...
#> ngram_min = 3; ngram_max = 7
#> Vocabulary:
#> term term_count doc_count
#> 1: i am sure 415 384
#> 2: i do not 412 363
#> 3: she could not 328 288
#> 4: it would be 258 233
#> 5: in the world 247 234
#> 6: as soon as 236 233
#> 7: a great deal 214 209
#> 8: would have been 211 192
#> 9: she had been 203 179
#> 10: it was a 202 194
Created on 2022-08-03 by the reprex package (v2.0.1)