Search code examples
rtext-analysistidytext

How do I parse out a specific section of text?


My goal is to pull out a specific section in a set of word documents according to key words. I'm having trouble parsing out specific sections of text from a larger data set of text files. The data set originally looked like this, with "title 1" and "title 2" to indicate the start and end of the text I am interested in and unimportant words to indicate the part of the text file I am not interested in:

**Text**           **Text File** 
title one           Text file 1
sentence one        Text file 1
sentence two        Text file 1
title two           Text file 1
unimportant words   Text file 1
title one           Text file 2
sentence one        Text file 2

Then I used as.character to turn the data into characters and used unnest_tokens to tidy up the data

df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
tidy_df <- df %>% unnest_tokens(word, Text, token = "words")

I would now like to only look at the sentences in my dataset and exclude the unimportant words. Title one and title two are the same in every text file, but the sentences between them are different. I've tried this code below, but it does not seem to work.

filtered_resume <- lapply(tidy_resume, (tidy_resume %>% select(Name) %>% filter(title:two)))

Solution

  • Not familiar with the tidytext package, so here's an alternative base R solution. Using this expanded example data (creation code included at bottom):

    > df
                    Text        File
    1          title one Text file 1
    2       sentence one Text file 1
    3       sentence two Text file 1
    4          title two Text file 1
    5  unimportant words Text file 1
    6          title one Text file 2
    7       sentence one Text file 2
    8       sentence two Text file 2
    9     sentence three Text file 2
    10         title two Text file 2
    11 unimportant words Text file 2
    

    Make a function that makes a separate column that indicates whether a given row should be kept or dropped, based on the value in the Text column. Details in comments:

    get_important_sentences <- function(df_) {
      # Create some variables for filtering
      val = 1
      keep = c()
    
      # For every text row
      for (x in df_$Text) {
        # Multiply the current val by 2
        val = val * 2
    
        # If the current text includes "title",
        # set val to 1 for 'title one', and to 2
        # for 'title two'
        if (grepl("title", x)) {
          val = ifelse(grepl("one", x), 1, 0)
        }
    
        # append val to keep each time
        keep = c(keep, val)
      }
    
      # keep is now a numeric vector- add it to
      # the data frame
      df_$keep = keep
    
      # exclude any rows where 'keep' is 1 (for
      # 'title one') or 0 (for 'title 2' or any
      # unimportant words). Also, drop the
      return(df_[df_$keep > 1, c("Text", "File")])
    }
    

    Then you can call that either on the whole data frame:

    > get_important_sentences(df)
                Text        File
    2   sentence one Text file 1
    3   sentence two Text file 1
    7   sentence one Text file 2
    8   sentence two Text file 2
    9 sentence three Text file 2
    

    Or on a per-file-source basis with lapply:

    > lapply(split(df, df$File), get_important_sentences)
    $`Text file 1`
              Text        File
    2 sentence one Text file 1
    3 sentence two Text file 1
    
    $`Text file 2`
                Text        File
    7   sentence one Text file 2
    8   sentence two Text file 2
    9 sentence three Text file 2
    

    Data:

    df <-
      data.frame(
        Text = c(
          "title one",
          "sentence one",
          "sentence two",
          "title two",
          "unimportant words",
          "title one",
          "sentence one",
          "sentence two",
          "sentence three",
          "title two",
          "unimportant words"
        ),
        File = c(rep("Text file 1", 5), rep("Text file 2", 6)),
        stringsAsFactors = FALSE
      )