Search code examples
rtext-mining

Text mining in R: delete first sentence of each document


I have several documents and do not need the first sentence of each document. I could not find a solution so far.

Here is an example. The structure of the data looks like this

case_number text
1 Today is a good day. It is sunny.
2 Today is a bad day. It is rainy.

So the results should look like this

case_number text
1 It is sunny.
2 It is rainy.

Here is the example dataset:

case_number <- c(1, 2)

text <- c("Today is a good day. It is sunny.",
          "Today is a bad day. It is rainy.")

data <- data.frame(case_number, text)

Solution

  • If there's a chance that sentences might include some punctuation (e.g. abbreviations or numerics), and you are using some text mining library anyway, it makes perfect sense to let it handle tokenization.

    With {tidytext} :

    library(dplyr)
    library(tidytext)
    
    # exmple with punctuation in 1st sentence
    data <- data.frame(case_number = c(1, 2),
                       text = c("Today is a good day, above avg. for sure, by 5.1 points. It is sunny.",
                                "Today is a bad day. It is rainy."))
    # tokenize to sentences, converting tokens to lowercase is optional
    data %>% 
      unnest_sentences(s, text)
    #>   case_number                                                        s
    #> 1           1 today is a good day, above avg. for sure, by 5.1 points.
    #> 2           1                                             it is sunny.
    #> 3           2                                      today is a bad day.
    #> 4           2                                             it is rainy.
    
    # drop 1st record of every case_number group
    data %>% 
      unnest_sentences(s, text) %>% 
      filter(row_number() > 1, .by = case_number)
    #>   case_number            s
    #> 1           1 it is sunny.
    #> 2           2 it is rainy.
    

    Created on 2023-08-10 with reprex v2.0.2