Search code examples
regexrdata-cleaningsrt

Combining fragmented sentences in an R dataframe


I have a dataframe which contains parts of whole sentences spread across, in some cases, multiple rows of a dataframe.

For example, head(mydataframe) returns

#  1 Do you have any idea what
#  2  they were arguing about?
#  3          Do--Do you speak
#  4                  English?
#  5                     yeah.
#  6            No, I'm sorry.

Assuming a sentence can be terminated by either

"." or "?" or "!" or "..."

are there any R library functions capable of outputting the following:

#  1 Do you have any idea what they were arguing about?
#  2          Do--Do you speak English?
#  3                     yeah.
#  4            No, I'm sorry.

Solution

  • This should work for all the sentences ending with: . ... ? or !

    x <- paste0(foo$txt, collapse = " ")
    trimws(unlist(strsplit(x, "(?<=[?.!|])(?=\\s)", perl=TRUE)))
    

    Credits to @AvinashRaj for the pointers on the lookbehind

    Which gives:

    #[1] "Do you have any idea what they were arguing about?"
    #[2] "Do--Do you speak English?"                         
    #[3] "yeah..."                                           
    #[4] "No, I'm sorry." 
    

    Data

    I modified the toy dataset to include a case where a string ends with ... (as per requested by OP)

    foo <- data.frame(num = 1:6,
                      txt = c("Do you have any idea what", "they were arguing about?",
                              "Do--Do you speak", "English?", "yeah...", "No, I'm sorry."), 
                      stringsAsFactors = FALSE)