Search code examples
rtext-mining

Split Player and Chat from Chat Log (text-mining)


I have a chat log which includes 4 players (A, B, C, D) and their chats in one row in my data frame (across many groups). I want to split each phrase into its own row and identify the speaker of that phrase in a separate column.

I have attempted many things using the following packages but haven't been able to succeed. psych dplyr splitstackshape tidytext stringr tidyr

The data frame is not a txt.document, but I'm thinking it needs to be?

For example this is what the chat log looks like. This is all in one row in my dataset.

[1] " *** D has joined the chat ***"                                                                                                                                         
  [2] " *** B has joined the chat ***"                                                                                                                                         
  [3] " *** A has joined the chat ***"                                                                                                                                         
  [4] "D: hi"                                                                                                                                                                  
  [5] "B: hello!"                                                                                                                                                              
  [6] "A: Hi!"                                                                                                                                                                 
  [7] "D: i think oxygen is most important"                                                                                                                                    
  [8] "A: I do too"                                                                                                                                                            
  [9] " *** C has joined the chat ***"                                                                                                                                         
 [10] "B: agreed, that was my #1"                                                                                                                                              
 [11] "A: I didnt at first but then on second guess"                                                                                                                           
 [12] "A: oxygen then water"                                                                                                                                                   
 [13] "C: hi hi"                                                              

I want the following (to have these columns where each row is a new phrase)

Player ID Phrase
A hi!
B hello!

I want to eventually use this to count # of words/characters per player


Solution

  • library(dplyr)
    library(tidyr)
    
    d %>%
      t() %>%
      as.data.frame("V1") %>%
      filter(!grepl("***", V1, fixed = TRUE)) %>%
      separate(V1, into = c("PlayerID", "Phrase"), sep = ": ") %>%
      mutate(Count = nchar(Phrase))
    

    result:

    #>   PlayerID                                    Phrase Count
    #> 1        D                                        hi     2
    #> 2        B                                    hello!     6
    #> 3        A                                       Hi!     3
    #> 4        D          i think oxygen is most important    32
    #> 5        A                                  I do too     8
    #> 6        B                    agreed, that was my #1    22
    #> 7        A I didnt at first but then on second guess    41
    #> 8        A                         oxygen then water    17
    #> 9        C                                     hi hi     5
    

    You could use add this to the dplyr chain to count the number of characters per player:

    group_by(PlayerID) %>%
    summarize(Total = sum(Count))
    
    #>   PlayerID Total
    #>   <chr>    <int>
    #> 1 A           69
    #> 2 B           28
    #> 3 C            5
    #> 4 D           34
    

    data:

    d <- structure(c(" *** D has joined the chat ***", " *** B has joined the chat ***", 
                     " *** A has joined the chat ***", "D: hi", "B: hello!", "A: Hi!", 
                     "D: i think oxygen is most important", "A: I do too", " *** C has joined the chat ***", 
                     "B: agreed, that was my #1", "A: I didnt at first but then on second guess", 
                     "A: oxygen then water", "C: hi hi"), .Dim = c(1L, 13L))
    
    Created on 2022-05-25 by the reprex package (v2.0.1)