Search code examples
rwordssentence

How to separate a sentence into words


In r, I'm currently working with datasets of conversations. The data currently looks like the following:

Mike, "Hello how are you"
Sally, "Good you"

I plan to eventually create a word cloud of this data and would need it to look like this:

Mike, Hello
Mike, how
Mike, are
Mike, you
Sally, good
Sally, you

Solution

  • Perhaps something like this using reshape2::melt?

    # Sample data
    df <- read.csv(text =
        'Mike, "Hello how are you"
        Sally, "Good you"', header = F)
    
    # Split on words
    lst <- strsplit(trimws(as.character(df[, 2])), "\\s");
    names(lst) <- trimws(df[, 1]);
    
    # Reshape into long dataframe 
    library(reshape2);
    df.long <- (melt(lst))[2:1];
    #     L1 value
    #1  Mike Hello
    #2  Mike   how
    #3  Mike   are
    #4  Mike   you
    #5 Sally  Good
    #6 Sally   you
    

    Explanation: Split trailing/leading whitespace-trimmed (trimws) entries in second column on whitespace \\s and store in list. Take list entry names from first column, and reshape into a long data.frame using reshape2::melt.

    I leave turning this into a comma-separated data.frame up to you...