My goal is to pull out a specific section in a set of word documents according to key words. I'm having trouble parsing out specific sections of text from a larger data set of text files. The data set originally looked like this, with "title 1" and "title 2" to indicate the start and end of the text I am interested in and unimportant words to indicate the part of the text file I am not interested in:
**Text** **Text File**
title one Text file 1
sentence one Text file 1
sentence two Text file 1
title two Text file 1
unimportant words Text file 1
title one Text file 2
sentence one Text file 2
Then I used as.character to turn the data into characters and used unnest_tokens to tidy up the data
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
tidy_df <- df %>% unnest_tokens(word, Text, token = "words")
I would now like to only look at the sentences in my dataset and exclude the unimportant words. Title one and title two are the same in every text file, but the sentences between them are different. I've tried this code below, but it does not seem to work.
filtered_resume <- lapply(tidy_resume, (tidy_resume %>% select(Name) %>% filter(title:two)))
Not familiar with the tidytext
package, so here's an alternative base R solution. Using this expanded example data (creation code included at bottom):
> df
Text File
1 title one Text file 1
2 sentence one Text file 1
3 sentence two Text file 1
4 title two Text file 1
5 unimportant words Text file 1
6 title one Text file 2
7 sentence one Text file 2
8 sentence two Text file 2
9 sentence three Text file 2
10 title two Text file 2
11 unimportant words Text file 2
Make a function that makes a separate column that indicates whether a given row should be kept or dropped, based on the value in the Text
column. Details in comments:
get_important_sentences <- function(df_) {
# Create some variables for filtering
val = 1
keep = c()
# For every text row
for (x in df_$Text) {
# Multiply the current val by 2
val = val * 2
# If the current text includes "title",
# set val to 1 for 'title one', and to 2
# for 'title two'
if (grepl("title", x)) {
val = ifelse(grepl("one", x), 1, 0)
}
# append val to keep each time
keep = c(keep, val)
}
# keep is now a numeric vector- add it to
# the data frame
df_$keep = keep
# exclude any rows where 'keep' is 1 (for
# 'title one') or 0 (for 'title 2' or any
# unimportant words). Also, drop the
return(df_[df_$keep > 1, c("Text", "File")])
}
Then you can call that either on the whole data frame:
> get_important_sentences(df)
Text File
2 sentence one Text file 1
3 sentence two Text file 1
7 sentence one Text file 2
8 sentence two Text file 2
9 sentence three Text file 2
Or on a per-file-source basis with lapply
:
> lapply(split(df, df$File), get_important_sentences)
$`Text file 1`
Text File
2 sentence one Text file 1
3 sentence two Text file 1
$`Text file 2`
Text File
7 sentence one Text file 2
8 sentence two Text file 2
9 sentence three Text file 2
Data:
df <-
data.frame(
Text = c(
"title one",
"sentence one",
"sentence two",
"title two",
"unimportant words",
"title one",
"sentence one",
"sentence two",
"sentence three",
"title two",
"unimportant words"
),
File = c(rep("Text file 1", 5), rep("Text file 2", 6)),
stringsAsFactors = FALSE
)