Search code examples
rtmstringi

How to remove words not in caps in R?


I'm doing text analysis using R. Is there a way to remove all the words not in caps using tm or stringi?

If I have something like this

Albert Einstein went to the store and saw his friend Nikola Tesla ... + 200 pags

to be converted into

Albert Einstein Nikola Tesla

Best regards


Solution

  • Just use grep and a regular expression:

    words <- 'Albert Einstein went to the store and saw his friend Nikola Tesla'
    
    # split to vector of individual words
    vec <- unlist(strsplit(words, ' '))
    # just the capitalized ones
    caps <- grep('^[A-Z]', vec, value = T)
    # assemble back to a single string, if you want
    paste(caps, collapse=' ')