I'm using the excellent tidytext
package to tokenize sentences in several paragraphs. For instance, I want to take the following paragraph:
"I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."
and tokenize it into the two sentences
However, when I use the default sentence tokenizer of tidytext
I get three sentences.
Code
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")
Result
# A tibble: 3 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr.
2 darcy has no defect.
3 he owns it himself without disguise.
What is a simple way to use tidytext
to tokenize sentences but without running into issues with common abbreviations such as "Mr." or "Dr." being interpreted as sentence endings?
You can use a regex as splitting condition, but there is no guarantee that this would include all common hororifics:
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = "(?<!\\b\\p{L}r)\\.")
Result:
# A tibble: 2 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2 he owns it himself without disguise
You can of course always create your own list of common titles, and create a regex based on that list:
titles = c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\\b(", paste(titles, collapse = "|"), "))\\.")
# > regex
# [1] "(?<!\\b(Mr|Dr|Mrs|Ms|Sr|Jr))\\."
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = regex)