Split Player and Chat from Chat Log (text-mining)

I have a chat log which includes 4 players (A, B, C, D) and their chats in one row in my data frame (across many groups). I want to split each phrase into its own row and identify the speaker of that phrase in a separate column.

I have attempted many things using the following packages but haven't been able to succeed. psych dplyr splitstackshape tidytext stringr tidyr

The data frame is not a txt.document, but I'm thinking it needs to be?

For example this is what the chat log looks like. This is all in one row in my dataset.

[1] " *** D has joined the chat ***"                                                                                                                                         
  [2] " *** B has joined the chat ***"                                                                                                                                         
  [3] " *** A has joined the chat ***"                                                                                                                                         
  [4] "D: hi"                                                                                                                                                                  
  [5] "B: hello!"                                                                                                                                                              
  [6] "A: Hi!"                                                                                                                                                                 
  [7] "D: i think oxygen is most important"                                                                                                                                    
  [8] "A: I do too"                                                                                                                                                            
  [9] " *** C has joined the chat ***"                                                                                                                                         
 [10] "B: agreed, that was my #1"                                                                                                                                              
 [11] "A: I didnt at first but then on second guess"                                                                                                                           
 [12] "A: oxygen then water"                                                                                                                                                   
 [13] "C: hi hi"

I want the following (to have these columns where each row is a new phrase)

Player ID	Phrase
A	hi!
B	hello!

I want to eventually use this to count # of words/characters per player

Solution

library(dplyr)
library(tidyr)

d %>%
  t() %>%
  as.data.frame("V1") %>%
  filter(!grepl("***", V1, fixed = TRUE)) %>%
  separate(V1, into = c("PlayerID", "Phrase"), sep = ": ") %>%
  mutate(Count = nchar(Phrase))

result:

#>   PlayerID                                    Phrase Count
#> 1        D                                        hi     2
#> 2        B                                    hello!     6
#> 3        A                                       Hi!     3
#> 4        D          i think oxygen is most important    32
#> 5        A                                  I do too     8
#> 6        B                    agreed, that was my #1    22
#> 7        A I didnt at first but then on second guess    41
#> 8        A                         oxygen then water    17
#> 9        C                                     hi hi     5

You could use add this to the dplyr chain to count the number of characters per player:

group_by(PlayerID) %>%
summarize(Total = sum(Count))

#>   PlayerID Total
#>   <chr>    <int>
#> 1 A           69
#> 2 B           28
#> 3 C            5
#> 4 D           34

data:

d <- structure(c(" *** D has joined the chat ***", " *** B has joined the chat ***", 
                 " *** A has joined the chat ***", "D: hi", "B: hello!", "A: Hi!", 
                 "D: i think oxygen is most important", "A: I do too", " *** C has joined the chat ***", 
                 "B: agreed, that was my #1", "A: I didnt at first but then on second guess", 
                 "A: oxygen then water", "C: hi hi"), .Dim = c(1L, 13L))

Created on 2022-05-25 by the reprex package (v2.0.1)