Search code examples
rtexttidytext

Tokenizing sentences with unnest_tokens(), ignoring abbreviations


I'm using the excellent tidytext package to tokenize sentences in several paragraphs. For instance, I want to take the following paragraph:

"I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."

and tokenize it into the two sentences

  1. "I am perfectly convinced by it that Mr. Darcy has no defect."
  2. "He owns it himself without disguise."

However, when I use the default sentence tokenizer of tidytext I get three sentences.

Code

df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))


unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")

Result

# A tibble: 3 x 1
                              Sentence
                                <chr>
1 i am perfectly convinced by it that mr.
2                    darcy has no defect.
3    he owns it himself without disguise.

What is a simple way to use tidytext to tokenize sentences but without running into issues with common abbreviations such as "Mr." or "Dr." being interpreted as sentence endings?


Solution

  • You can use a regex as splitting condition, but there is no guarantee that this would include all common hororifics:

    unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
                  pattern = "(?<!\\b\\p{L}r)\\.")
    

    Result:

    # A tibble: 2 x 1
                                                         Sentence
                                                            <chr>
    1 i am perfectly convinced by it that mr. darcy has no defect
    2                         he owns it himself without disguise
    

    You can of course always create your own list of common titles, and create a regex based on that list:

    titles =  c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
    regex = paste0("(?<!\\b(", paste(titles, collapse = "|"), "))\\.")
    # > regex
    # [1] "(?<!\\b(Mr|Dr|Mrs|Ms|Sr|Jr))\\."
    
    unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
                  pattern = regex)