I have a chat log which includes 4 players (A, B, C, D) and their chats in one row in my data frame (across many groups). I want to split each phrase into its own row and identify the speaker of that phrase in a separate column.
I have attempted many things using the following packages but haven't been able to succeed. psych dplyr splitstackshape tidytext stringr tidyr
The data frame is not a txt.document, but I'm thinking it needs to be?
For example this is what the chat log looks like. This is all in one row in my dataset.
[1] " *** D has joined the chat ***"
[2] " *** B has joined the chat ***"
[3] " *** A has joined the chat ***"
[4] "D: hi"
[5] "B: hello!"
[6] "A: Hi!"
[7] "D: i think oxygen is most important"
[8] "A: I do too"
[9] " *** C has joined the chat ***"
[10] "B: agreed, that was my #1"
[11] "A: I didnt at first but then on second guess"
[12] "A: oxygen then water"
[13] "C: hi hi"
I want the following (to have these columns where each row is a new phrase)
Player ID | Phrase |
---|---|
A | hi! |
B | hello! |
I want to eventually use this to count # of words/characters per player
library(dplyr)
library(tidyr)
d %>%
t() %>%
as.data.frame("V1") %>%
filter(!grepl("***", V1, fixed = TRUE)) %>%
separate(V1, into = c("PlayerID", "Phrase"), sep = ": ") %>%
mutate(Count = nchar(Phrase))
result:
#> PlayerID Phrase Count
#> 1 D hi 2
#> 2 B hello! 6
#> 3 A Hi! 3
#> 4 D i think oxygen is most important 32
#> 5 A I do too 8
#> 6 B agreed, that was my #1 22
#> 7 A I didnt at first but then on second guess 41
#> 8 A oxygen then water 17
#> 9 C hi hi 5
You could use add this to the dplyr chain to count the number of characters per player:
group_by(PlayerID) %>%
summarize(Total = sum(Count))
#> PlayerID Total
#> <chr> <int>
#> 1 A 69
#> 2 B 28
#> 3 C 5
#> 4 D 34
data:
d <- structure(c(" *** D has joined the chat ***", " *** B has joined the chat ***",
" *** A has joined the chat ***", "D: hi", "B: hello!", "A: Hi!",
"D: i think oxygen is most important", "A: I do too", " *** C has joined the chat ***",
"B: agreed, that was my #1", "A: I didnt at first but then on second guess",
"A: oxygen then water", "C: hi hi"), .Dim = c(1L, 13L))
Created on 2022-05-25 by the reprex package (v2.0.1)