In some cases, certain periods are mistakenly used as sentence breaks when using corpus_reshape
. I have a corpus from the pharmaceutical industry and in many cases "Dr." is mistakenly used as a sentence break.
This post (Quanteda's corpus_reshape function: how not to break sentences after abbreviations (like "e.g.")) is similar but does unfortunately solve the problem. Here is an example:
library("quanteda")
txt <- c(
d1 = "With us we have Dr. Smith. We are not sure... where we stand.",
d2 = "The U.S. is south of Canada."
)
corpus(txt) %>%
corpus_reshape(to = "sentences")
Corpus consisting of 4 documents. d1.1 : "With us we have Dr."
d1.2 : "Smith."
d1.3 : "We are not sure... where we stand."
d2.1 : "The U.S. is south of Canada."
It works only for few cases with "Dr.". I was wondering if certain words to be excluded can be added to the function because I would like to avoid using an alternative function to break the text into sentences. Thanks!
Please use corpus_segment
with pattern
& valuetype = "regex"
.
You may find example here
https://quanteda.io/reference/corpus_segment.html
You may also use use_docvars
option.