Search code examples
pythonnlpstanford-nlptokenize

What is Stanford CoreNLP's recipe for tokenization?


Whether you're using Stanza or Corenlp (now deprecated) python wrappers, or the original Java implementation, the tokenization rules that StanfordCoreNLP follows is super hard for me to figure out from the code in the original codebases.

The implementation is very verbose and the tokenization approach is not really documented. Do they consider this proprietary? On their website, they say that "CoreNLP splits texts into tokens with an elaborate collection of rules, designed to follow UD 2.0 specifications."

I'm looking for where to find those rules, and ideally, to replace CoreNLP (a massive codebase!) with just a regex or something much simpler to mimic their tokenization strategy. Please assume in your responses that Stanford's tokenization approach is the goal. I am not looking for alternative tokenization solutions, but I also very much do not want to include and ship a code base that requires a massive java library as a dependency.

The answer should address the following behavior:

  • Word hyphenation should be disabled (someone with a hyphenated last name should not be split, e.g., Marie Illonig-Alberts should tokenize as ["Marie", "Illonig-Alberts"]. Similarly, compound words like "well-intentioned" should not be split.
  • Plural apostrophes should be tokenized (e.g., all boys' shoes are red to ["all", "boys", "'", "shoes", "are", red"])
  • Apostrophes for single ownership (e.g., my aunt's favorite to ["my", "aunt", "'s", "favorite"]
  • Mr./Mrs. should not be ["Mr", "."] / ["Mrs", "."]
  • Normal punctuation should be their own tokens (end of sentence periods, commas, quotes for direct quotes or to denote sarcasm, question marks, semicolon and colons, and dashes). Double dashes should not be separated (e.g., -- is ["--"] NOT ["-", "-"]
  • Wouldn't should tokenize to ["would", "n't"]
  • "and/or" should not tokenize
  • Contractions should tokenize (e.g., I'm to ["I", "'m"]
  • I also see weird tokens that correspond to POS tags sometimes like "-LRB-" and ":-RRB-", which I do not understand.

Solution

  • Here are a few notes from one of main authors of it. What you write in your answer is all basically correct, but there are many nuances. 😊

    • Yes, the CoreNLP tokenizer was written to follow the Linguistic Data Consortium (LDC)'s English tokenization, but there are actually two versions of it: old treebank tokenization ("Penn Treebank 3" https://catalog.ldc.upenn.edu/LDC99T42, 20th century) and new treebank tokenization ("OntoNotes" https://catalog.ldc.upenn.edu/LDC2013T19, 21st century). PTBTokenizer supports both by specifying options. The biggest difference is that the new tokenization splits on most hyphens (except common prefixes, suffixes), which seems to not be what you want.
    • LDC tokenization, especially for the old tokenization, was unspecified for many things and we make our own pragmatic decisions (emoji, URLs, etc.)
    • UD v2 tokenization, our current default, basically follows new treebank tokenization, but it does not escape brackets (i.e., ( ) { } become -LRB- -RRB- -LCB- -RCB- in LDC tokenization, something you appear not to want.
    • The NLTK tokenizer roughly does old treebank tokenization, but it differs (i.e., is worse) on many details. That's why the CoreNLP tokenizer is complex! To mention just two, URLs get broken up and it doesn't handle double contractions:
    t.tokenize("Independent Living http://www.inlv.demon.nl/.")
    ['Independent', 'Living', 'http', ':', '//www.inlv.demon.nl/', '.']
    # versus CoreNLP output is: 
    # { "Independent", "Living", "http://www.inlv.demon.nl/", "." }
    t.tokenize("I'd've thought that they'd've liked it.")
    ["I'd", "'ve", 'thought', 'that', "they'd", "'ve", 'liked', 'it', '.']
    # versus CoreNLP output is: 
    # { "I", "'d", "'ve", "thought", "that", "they", "'d", "'ve", "liked", "it", "." }