Search code examples
rdplyrpurrrstringrgoogle-language-api

Converting a dialogue tibble to .txt, and back again


I want to take a tibble that represents dialogue and turn it into a .txt that can be manually edited in a text editor and then returned to a tibble for processing.

The key challenge I've had is separating the blocks of text in a way that they can be re-imported to a similar format after editing while preserving the "Speaker" designation.

Speed is important as the volume of files and the length of each text segment are large.

Here's the input tibble:

tibble::tribble(
    ~word, ~speakerTag,
   "been",          1L,
  "going",          1L,
     "on",          1L,
    "and",          1L,
   "what",          1L,
   "your",          1L,
  "goals",          1L,
   "are.",          1L,
  "Yeah,",          2L,
     "so",          2L,
     "so",          2L,
   "John",          2L,
    "has",          2L,
     "15",          2L
  )

Here's the desired output in a .txt:

###Speaker 1###
been going on and what your goals are.
###Speaker 2###
Yeah, so so John has 15

Here's the desired return after correcting errors manually:

    ~word, ~speakerTag,
   "been",          1L,
  "going",          1L,
     "on",          1L,
    "and",          1L,
   "what",          1L,
   "your",          1L,
  "goals",          1L,
   "in",            1L,
   "r",             1L,
  "Yeah,",          2L,
     "so",          2L,
     "so",          2L,
   "John",          2L,
    "hates",        2L,
     "50",          2L
  )

Solution

  • One way would be to add Speaker name "\n" at the start of each speakerTag

    library(data.table)
    library(dplyr)
    library(tidyr)
    
    setDT(df)[, word := replace(word, 1, paste0("\n\nSpeaker", 
                first(speakerTag), '\n\n', first(word))), rleid(speakerTag)]
    

    We can write this in text file using

    writeLines(paste(df$word, collapse = " "), 'Downloads/temp.txt')
    

    It looks like this :

    cat(paste(df$word, collapse = " "))
    
    #Speaker1
    #
    #been going on and what your goals are. 
    #
    #Speaker2
    #
    #Yeah, so so John has 15
    

    To read it back in R, we can do :

    read.table('Downloads/temp.txt', sep="\t", col.names = 'word') %>%
        mutate(SpeakerTag = replace(word, c(FALSE, TRUE), NA)) %>%
        fill(SpeakerTag) %>%
        slice(seq(2, n(), 2)) %>%
        separate_rows(word, sep = "\\s") %>%
        filter(word != '')
    
    #    word SpeakerTag
    #1   been   Speaker1
    #2  going   Speaker1
    #3     on   Speaker1
    #4    and   Speaker1
    #5   what   Speaker1
    #6   your   Speaker1
    #7  goals   Speaker1
    #8   are.   Speaker1
    #9  Yeah,   Speaker2
    #10    so   Speaker2
    #11    so   Speaker2
    #12  John   Speaker2
    #13   has   Speaker2
    #14    15   Speaker2
    

    Obviously we can remove "Speaker" part in SpeakerTag column if it is not needed.