Search code examples
rtextnlpdata-wranglingquanteda

In R, chop off column after n words


I have a df with a text column, and a column with a wordcount value.

How can I delete the last n words of the text (specified in the 'wc' column) and save the output to a third column?

In other words, I need the "introductory" part of a bunch of texts, and I know when the intro ends, so I want to cut the text off at that point and save the intro in a new column.

df <- data.frame(text = c("this is a long text","this is also a long text", "another long text"),wc=c('1','2','1'))

Desired output:

text wc chopped_off_text
this is a long text 1 this is a long
this is also a long text 2 this is also a
another long text 1 another long

Solution

  • You can use the word function from the stringr package to extract "words" in a sentence. str_count(text, "\\s") + 1 counts the number of words present in the sentence.

    library(stringr)
    library(dplyr)
    
    df %>% 
      mutate(chopped_off_text = 
               word(text, 1, end = str_count(text, "\\s") + 1 - as.integer(wc)))
    
                          text wc chopped_off_text
    1      this is a long text  1   this is a long
    2 this is also a long text  2   this is also a
    3        another long text  1     another long