How to randomly select paragraphs from a corpus by excluding, from the randomization, those paragraphs that include a specific list of words?

I have a corpus. From this corpus I would like to randomly extract paragraphs. However, the randomization exercise must be such that paragraphs with specific words cannot be sampled.

This is an example:

txt <- c("PAGE 1. A single sentence.  Short sentence. Three word sentence. \n\n Quarentine is hard",
         "PAGE 2. Very short! Shorter.\n\n quarantine is very very hard",
         "Very long sentence, with three parts, separated by commas.  PAGE 3.\n\n quarantine it's good tough to focus on paper.",
         "Fiscal policy is a bad thing. \n\n SO is a great place where skilled people solve coding problems.",
         "Fiscal policy is not as good as people may think",
         "Economics is fun. \n\n I prefer Macro.")
corp <- corpus(txt, docvars = data.frame(serial = 1:6))

It is straigthforward to do it without any constraint:

reshape = corpus_reshape(corp, "paragraphs")
sample = corpus_sample(reshape, 4)

# Result

[1] "Economics is fun."                                "Fiscal policy is not as good as people may think"
[3] "Fiscal policy is a bad thing."                    "Quarentine is hard"

As you see, the randomization picked "paragraphs" that contain fiscal policy. I would like the corpus to be sampled by excluding paragraphs/sentences where fiscal policy appears.

I may delete sentences related to this words in the original dataset before doing the sampling? How would you do it?

Please, note that in the real dataset I will need to exclude sentences with more than just one or two keywords. So, please suggest something that can be easily expanded to many words.

Thanks a lot!

Solution

If you want to exclude paragraphs/sentences that contain "fiscal policy" then you need to first reshape the text into paragraphs, and then filter out the terms that contain the exclusion phrase, and only then sample.

If you filter the text before creating the corpus, you will exclude the non-filter phrase paragraphs from an input text that also contains the filter phrase(s).

library("quanteda")
## Package version: 2.0.1
set.seed(10)

txt <- c(
  "PAGE 1. A single sentence.  Short sentence. Three word sentence. \n\n Quarentine is hard",
  "PAGE 2. Very short! Shorter.\n\n quarantine is very very hard",
  "Very long sentence, with three parts, separated by commas.  PAGE 3.\n\n quarantine it's good tough to focus on paper.",
  "Fiscal policy is a bad thing. \n\n SO is a great place where skilled people solve coding problems.",
  "Fiscal policy is not as good as people may think",
  "Economics is fun. \n\n I prefer Macro."
)

corp <- corpus(txt, docvars = data.frame(serial = 1:6)) %>%
  corpus_reshape(to = "paragraphs")
tail(corp)
## Corpus consisting of 6 documents and 1 docvar.
## text3.2 :
## "quarantine it's good tough to focus on paper."
## 
## text4.1 :
## "Fiscal policy is a bad thing."
## 
## text4.2 :
## "SO is a great place where skilled people solve coding proble..."
## 
## text5.1 :
## "Fiscal policy is not as good as people may think"
## 
## text6.1 :
## "Economics is fun."
## 
## text6.2 :
## "I prefer Macro."

Now we can subset based on the pattern match.

corp2 <- corpus_subset(corp, !grepl("fiscal policy", corp, ignore.case = TRUE))
tail(corp2)
## Corpus consisting of 6 documents and 1 docvar.
## text2.2 :
## "quarantine is very very hard"
## 
## text3.1 :
## "Very long sentence, with three parts, separated by commas.  ..."
## 
## text3.2 :
## "quarantine it's good tough to focus on paper."
## 
## text4.2 :
## "SO is a great place where skilled people solve coding proble..."
## 
## text6.1 :
## "Economics is fun."
## 
## text6.2 :
## "I prefer Macro."

corpus_sample(corp2, size = 4)
## Corpus consisting of 4 documents and 1 docvar.
## text6.2 :
## "I prefer Macro."
## 
## text1.2 :
## "Quarentine is hard"
## 
## text2.2 :
## "quarantine is very very hard"
## 
## text3.2 :
## "quarantine it's good tough to focus on paper."

The paragraphs containing "fiscal policy" are gone.

Note that here I used grepl() but an all-around superior replacement is the str_detect() from stringi (or the equivalent stringr wrappers). These also give you more control in being able to use the faster fixed matching while also controlling whether case is matched.

all.equal(
  grepl("fiscal policy", txt, ignore.case = TRUE),
  stringi::stri_detect_fixed(txt, "fiscal policy", case_insensitive = TRUE),
  stringr::str_detect(txt, fixed("fiscal policy"), case_insensitive = TRUE)
)
## [1] TRUE