Search code examples
performancescalamachine-learningstanford-nlp

Fastest way to lemmatize sentences


So I'm currently building a classification pipeline, and at this point the corenlp lemmatizer appears to be a fairly significant bottleneck. I'm trying to figure out if the way in which I am lemmatizing is causing the slowdown or if lemmatization is just slow in general.

Here's my current code:

 def singleStanfordSentenceToLemmas(sentence: String): Seq[String] = {
    val doc = new Annotation(sentence)
    pipeline.annotate(doc)
    val tokens = doc.get(classOf[TokensAnnotation]).toList
    tokens.par.foldLeft(Seq[String]())(
      (a, b) => {
        val lemma = b.get(classOf[LemmaAnnotation])
        if (!(stopWords.contains(b.lemma().toLowerCase) || puncWords.contains(b.originalText())))
          a :+ lemma.toLowerCase
        else a
      }
    )
  }

And here's the code that creates the pipeline

  val props = new Properties()

  props.put("annotators", "tokenize, ssplit, pos, lemma")
  val pipeline = new StanfordCoreNLP(props)

My current theories are

a) the fact that I'm using a fullblown coreNLP object is carrying a lot of overhead that is slowing down everything. Perhaps there is a more minimal class that ONLY lemmatizes?

b) The fact that the lemmatizer requires ssplit, and POS tagging seems pretty intense, since I'm only giving it individual sentences, is there a more efficient way for finding lemmas of individual words?

c) perhaps corenlp is just slow and there might be a faster lemmatizer out there.

Any help would be highly appreciated!


Solution

  • a) Yes, there is certainly overhead there. You can get rid of some of it, but CoreNLP seems (to me) rather inconsistent in separating the core Pipeline Wrappers from the underlying, more direct entities.But you can do:

    import edu.stanford.nlp.process.Morphology
    val morph = new Morphology()
    ...
    morph.stem(label)
    

    You will also need siomething like

    private lazy val POSTagger =
    new MaxentTagger("edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger")
    

    to previously tag POS, but I think this puts you on the right track.

    b) You wont get rid of all this easily. CoreLabel is the main data structure around CoreNLP and is ussed to add more and more data to the same elements. So lemmatization will add lemmas to the same structure. POS tagging will be used bu the lemmatizer in order to differentiate between nouns, verbs, etc and will pick POS tags from there too.

    c) Yes, this is the case too. How to deal with this varies a lot with your intent and context. I'm for instance using CoreNLP inside Spark to use the full power of a distributed cluster and I'm also pre-computing and storing some of this data. I hope this gives you some insight.