scala machine-learning apache-spark stanford-nlp

coreNLP significantly slowing down spark job`

I'm attempting to make a spark job that does classification through cutting a document into sentences, and then lemmatizing each word in the sentence for logistic regression. However, I'm finding that stanford's annotation class is causing a SERIOUS bottleneck in my spark job (it's taking 20 minutes to process only 500k documents)

Here is the code I'm currently using for sentence parsing and classification

Sentence parsing:

def prepSentences(text: String): List[CoreMap] = {
    val mod = text.replace("Sr.", "Sr") // deals with an edge case
    val doc = new Annotation(mod)
    pipeHolder.get.annotate(doc)
    val sentences = doc.get(classOf[SentencesAnnotation]).toList
    sentences
}

I then take each coremap and process the lemmas as follows

def coreMapToLemmas(map:CoreMap):Seq[String] = {
      map.get(classOf[TokensAnnotation]).par.foldLeft(Seq[String]())(
    (a, b) => {
        val lemma = b.get(classOf[LemmaAnnotation])
        if (!(stopWords.contains(b.lemma().toLowerCase) || puncWords.contains(b.originalText())))
      a :+ lemma.toLowerCase
    else a
  }
)
}

Perhaps there's a class that only involves some of the processing?

Solution

Try using CoreNLP's Shift Reduce parser implementation.

A basic example (typing this without a compiler):

val p = new Properties()
p.put("annotators", "tokenize ssplit pos parse lemma sentiment")
// use Shift-Reduce Parser with beam search
// http://nlp.stanford.edu/software/srparser.shtml
p.put("parse.model", "edu/stanford/nlp/models/srparser/englishSR.beam.ser.gz")
val corenlp = new StanfordCoreNLP(props)

val text = "text to annotate"
val annotation = new Annotation(text)
corenlp.annotate(text)

I work on a production system which uses CoreNLP in a Spark processing pipeline. Using the Shift Reduce parser with Beam search improved the parsing speed of my pipeline by a factor of 16 and reduced the amount of working memory required for parsing. The Shift Reduce parser is linear in runtime complexity, which is better than the standard lexicalized PCFG parser.

To use the shift reduce parser, you'll need the shift reduce models jar which you should put on your classpath (which you can download from CoreNLP's website separately).