Search code examples
rtext-miningcorpus

R: How to delete words other than specific words in a corpus


In the corpus "tkn_pb" , I would like to delete all words except for some keywords I chose (ex. "attack" and "gunman"). Is it possicle to do this?

enter image description here


Solution

  • You can use whichand grepl to subset your corpus:

    Data:

    sample_tokens <- c("word", "another","a", "new", "word token", "one", "more", "and", "another one")
    

    Remove all words except "a" and "and":

    sample_tokens[which(grepl("\\b(a|and)\\b", sample_tokens))]
    [1] "a"   "and"
    

    EDIT:

    If the corpus is a list, then this solution suggested by @John would work:

    Data:

    sample_tokens <- list(c("word", "another","a", "new", "word token", "one", "more", "and", "another one"),
                   c("yet", "a", "few", "more", "words"),
                   c("and", "so on"))
    
    lapply(sample_tokens, function(x) x[which(grepl("\\b(a|and)\\b", x))])
    [[1]]
    [1] "a"   "and"
    
    [[2]]
    [1] "a"
    
    [[3]]
    [1] "and"