Search code examples
stanford-nlp

Stanford Parser models


Stanford CoreNLP contains several models for parsing English sentences.

  • englishSR
  • english_SD
  • english_UD (default for depparse annotator)
  • englishRNN
  • englishFactored
  • englishPCFG (default for parse annotator)
  • englishPCFG.caseless
  • wsjRNN
  • wsjFactored
  • wsjPCFG

There are some comparisons in following papers:

I couldn't find full description and comparison for all models. Does it exist anywhere? If not I think it is worth to create.


Solution

  • I can't give a full list (maybe Chris will chime in?), but my understanding is that these models are:

    • englishSR: The shift reduce model trained on various standard treebanks, and some of Stanford's hand-annotated data. This is the fastest and most accurate model we have, but the model is huge to load.

    • english_SD: The NN Dependency Parser model for Stanford Dependencies. Deprecated in favor of english_UD -- the Universal Dependencies model.

    • english_UD: The NN Dependency Parser model for Universal Dependencies. This is the fastest and most accurate way to get dependency trees, but it won't give you constituency parses.

    • englishRNN: The hybrid PCFG + Neural constituency parser model. More accurate than any of the constituency parsers other than the shift-reduce model, but also noticeably slower.

    • englishFactored: Not 100% sure what this is, but my impression is that both accuracy and speed-wise it's between englishPCFG and englishRNN.

    • englishPCFG: A regular old PCFG model for constituency parsing. Fast to load, and faster than any of the constituency models other than the shift-reduce model, but also kind of mediocre accuracy by modern standards. Nonetheless, a good default.

    • englishPCFG.caseless: A caseless version of the PCFG model.

    I assume the wsj* models are there to reproduce numbers in papers (trained on the proper WSJ splits), but again I'm not 100% sure what they are.

    To help chose the right model based on speed, accuracy, and the base memory used by the model:

    • Fast: 10x, accurate, high-memory: englishSR

    • Medium: 1x, ok accuracy, low-memory: englishPCFG

    • Slow: ~0.25x, accurate, low-memory: englishRNN

    • Fast: 100x, accurate, low-memory, dependency parses only: english_UD