I have a data frame in R with a column that I need to do basic text analysis on. I am able to do this modifying the code as needed from this source. However, I now need to do this same analysis but for groups of data. I've included the dput
of a small sample here.
structure(list(Pad.Name = c("MISSOURI W", "MISSOURI W", "MISSOURI W",
"LEE", "LEE", "LEE"), Message = c("pump maint", "PUMP MAINT", "Pump Maintenance",
"waiting on wireline",
"seating the ball", "Waiting on wireline")), row.names = 11:16, class = "data.frame")
I want to group by the variable Pad.Name. I've tried using corpus_group
function from the quanteda
as well as the corpus
function from the same package, setting the parameters as follows: docid_field = dat$Pad.Name
and text_field = dat$Message
. Yet none of these seem to work.
My desired output are the most frequent words, say the top 10 most frequent, and a count of those words, for each unique Pad.Name. Similar something to as follows, however the true counts would work out, obviously:
edit: the table option never seems to work here, so here is a dput and data frame of my desired output
structure(list(Pad.Name = c("MISSOURI W", "MISSOURI W", "LEE",
"LEE"), Word = c("pump", "maint", "waiting", "wireline"), Count = c(3,
2, 2, 2)), class = "data.frame", row.names = c(NA, -4L))
output <- data.frame(Pad.Name = c("MISSOURI W", "MISSOURI W", "LEE", "LEE"), Word = c("pump", "maint", "waiting", "wireline"), Count = c(3,2,2,2))
Would dplyr and tidytext do?
library(tidytext)
library(dplyr)
as_tibble(data) %>%
# split to words
unnest_tokens(word,Message) %>%
# filter out stopwords
anti_join(get_stopwords()) %>%
# count by (Pad.Name, word) groups
count(Pad.Name, word, name = "Count", sort = T) %>%
# output is sorted by Count, no grouping, keep top-4
slice_head(n = 4) %>%
arrange(Pad.Name, desc(Count))
#> Joining, by = "word"
#> # A tibble: 4 × 3
#> Pad.Name word Count
#> <chr> <chr> <int>
#> 1 LEE waiting 2
#> 2 LEE wireline 2
#> 3 MISSOURI W pump 3
#> 4 MISSOURI W maint 2
Input:
data <- structure(list(Pad.Name = c(
"MISSOURI W", "MISSOURI W", "MISSOURI W",
"LEE", "LEE", "LEE"
), Message = c(
"pump maint", "PUMP MAINT", "Pump Maintenance",
"waiting on wireline",
"seating the ball", "Waiting on wireline"
)), row.names = 11:16, class = "data.frame")
Created on 2023-01-26 with reprex v2.0.2