Search code examples
rdataframegroupingtext-miningcorpus

Grouping text data in a corpus by a data frame variable


I have a data frame in R with a column that I need to do basic text analysis on. I am able to do this modifying the code as needed from this source. However, I now need to do this same analysis but for groups of data. I've included the dput of a small sample here.

structure(list(Pad.Name = c("MISSOURI W", "MISSOURI W", "MISSOURI W", 
"LEE", "LEE", "LEE"), Message = c("pump maint", "PUMP MAINT", "Pump Maintenance", 
"waiting on wireline", 
"seating the ball", "Waiting on wireline")), row.names = 11:16, class = "data.frame")

I want to group by the variable Pad.Name. I've tried using corpus_group function from the quanteda as well as the corpus function from the same package, setting the parameters as follows: docid_field = dat$Pad.Name and text_field = dat$Message. Yet none of these seem to work.

My desired output are the most frequent words, say the top 10 most frequent, and a count of those words, for each unique Pad.Name. Similar something to as follows, however the true counts would work out, obviously:

edit: the table option never seems to work here, so here is a dput and data frame of my desired output

structure(list(Pad.Name = c("MISSOURI W", "MISSOURI W", "LEE", 
"LEE"), Word = c("pump", "maint", "waiting", "wireline"), Count = c(3, 
2, 2, 2)), class = "data.frame", row.names = c(NA, -4L))

output <- data.frame(Pad.Name = c("MISSOURI W", "MISSOURI W", "LEE", "LEE"), Word = c("pump", "maint", "waiting", "wireline"), Count = c(3,2,2,2))

Solution

  • Would dplyr and tidytext do?

    library(tidytext)
    library(dplyr)
    
    as_tibble(data) %>% 
      # split to words
      unnest_tokens(word,Message) %>% 
      # filter out stopwords
      anti_join(get_stopwords()) %>% 
      # count by (Pad.Name, word) groups 
      count(Pad.Name, word, name = "Count", sort = T) %>%
      # output is sorted by Count, no grouping, keep top-4
      slice_head(n = 4) %>% 
      arrange(Pad.Name, desc(Count))
    #> Joining, by = "word"
    #> # A tibble: 4 × 3
    #>   Pad.Name   word     Count
    #>   <chr>      <chr>    <int>
    #> 1 LEE        waiting      2
    #> 2 LEE        wireline     2
    #> 3 MISSOURI W pump         3
    #> 4 MISSOURI W maint        2
    

    Input:

    data <- structure(list(Pad.Name = c(
      "MISSOURI W", "MISSOURI W", "MISSOURI W",
      "LEE", "LEE", "LEE"
    ), Message = c(
      "pump maint", "PUMP MAINT", "Pump Maintenance",
      "waiting on wireline",
      "seating the ball", "Waiting on wireline"
    )), row.names = 11:16, class = "data.frame")
    
    

    Created on 2023-01-26 with reprex v2.0.2