Search code examples
rliststrsplit

How to split words in R while keeping contractions


I'm trying to turn a character vector novel.lower.mid into a list of single words. So far, this is the code I've used:

midnight.words.l <- strsplit(novel.lower.mid, "\\W")

This produces a list of all the words. However, it splits everything, including contractions. The word "can't" becomes "can" and "t". How do I make sure those words aren't separated, or that the function just ignores the apostrophe?


Solution

  • We can use

    library(stringr)
    str_extract_all(novel.lower.mid,  "\\b[[:alnum:]']+\\b")
    

    Or

     strsplit(novel.lower.mid, "(?!')\\W", perl=TRUE)