I want to take a tibble that represents dialogue and turn it into a .txt that can be manually edited in a text editor and then returned to a tibble for processing.
The key challenge I've had is separating the blocks of text in a way that they can be re-imported to a similar format after editing while preserving the "Speaker" designation.
Speed is important as the volume of files and the length of each text segment are large.
Here's the input tibble:
tibble::tribble(
~word, ~speakerTag,
"been", 1L,
"going", 1L,
"on", 1L,
"and", 1L,
"what", 1L,
"your", 1L,
"goals", 1L,
"are.", 1L,
"Yeah,", 2L,
"so", 2L,
"so", 2L,
"John", 2L,
"has", 2L,
"15", 2L
)
Here's the desired output in a .txt:
###Speaker 1###
been going on and what your goals are.
###Speaker 2###
Yeah, so so John has 15
Here's the desired return after correcting errors manually:
~word, ~speakerTag,
"been", 1L,
"going", 1L,
"on", 1L,
"and", 1L,
"what", 1L,
"your", 1L,
"goals", 1L,
"in", 1L,
"r", 1L,
"Yeah,", 2L,
"so", 2L,
"so", 2L,
"John", 2L,
"hates", 2L,
"50", 2L
)
One way would be to add Speaker name "\n"
at the start of each speakerTag
library(data.table)
library(dplyr)
library(tidyr)
setDT(df)[, word := replace(word, 1, paste0("\n\nSpeaker",
first(speakerTag), '\n\n', first(word))), rleid(speakerTag)]
We can write this in text file using
writeLines(paste(df$word, collapse = " "), 'Downloads/temp.txt')
It looks like this :
cat(paste(df$word, collapse = " "))
#Speaker1
#
#been going on and what your goals are.
#
#Speaker2
#
#Yeah, so so John has 15
To read it back in R, we can do :
read.table('Downloads/temp.txt', sep="\t", col.names = 'word') %>%
mutate(SpeakerTag = replace(word, c(FALSE, TRUE), NA)) %>%
fill(SpeakerTag) %>%
slice(seq(2, n(), 2)) %>%
separate_rows(word, sep = "\\s") %>%
filter(word != '')
# word SpeakerTag
#1 been Speaker1
#2 going Speaker1
#3 on Speaker1
#4 and Speaker1
#5 what Speaker1
#6 your Speaker1
#7 goals Speaker1
#8 are. Speaker1
#9 Yeah, Speaker2
#10 so Speaker2
#11 so Speaker2
#12 John Speaker2
#13 has Speaker2
#14 15 Speaker2
Obviously we can remove "Speaker"
part in SpeakerTag
column if it is not needed.