Search code examples
nlpstanford-nlpuimadkpro-core

Reusable version of DKPro Core pipeline


I have set up DKPro Core as a web service to take an input and provide a tokenised output. The service itself is set up as a Jersey resource:

@Path("/")
public class MyResource
{

  public MyResource()
  {
    // Nothing here
  }

  @GET
  public String generate(@QueryParam("q") final String input)
  {
    try
    {
      final JCasIterable en = iteratePipeline(
        createReaderDescription(StringReader.class, StringReader.PARAM_DOCUMENT_TEXT, input, StringReader.PARAM_LANGUAGE, "en")
       ,createEngineDescription(StanfordSegmenter.class)
       ,createEngineDescription(StanfordPosTagger.class)
       ,createEngineDescription(StanfordParser.class)
       ,createEngineDescription(StanfordNamedEntityRecognizer.class)
      );

      final StringBuilder sb = new StringBuilder();
      for (final JCas jCas : en)
      {
        for (final Token token : select(jCas, Token.class))
        {
          sb.append('[');
          sb.append(token.getCoveredText());
          sb.append(' ');
          sb.append(token.getPos().getPosValue());
          sb.append(']');
        }
      }
      return sb.toString();
    }
    catch (final Exception e)
    {
      throw new RuntimeException("Problem", e);
    }
  }
}

Everything works but it is very slow, taking 7-10 seconds for each input. I assume that this is because the pipeline is being recreated for each request.

How can this code be reworked to move the pipeline creation to the constructor and reduce the load for individual requests? Note that there could be multiple simultaneous requests so anything that isn't thread-safe will need to be inside the request.


Solution

  • Create a single CAS:

    JCas jcas = JCasFactory.createJCas();
    

    Fill the CAS

    jcas.setDocumentText("This is a test");
    jcas.setDocumentLanguage("en");
    

    Create the pipeline once (and keep the engine around for further requests) using

    AnalysisEngine engine = createEngine(
       createEngineDescription(...),
       createEngineDescription(...),
       ...);
    

    If you create the engine implicitly all the time, it has to load models etc over and over again.

    Apply the pipeline to the CAS

    SimplePipeline.runPipeline(jcas, engine);
    

    If you want to further speed up processing, then create yourself a pool of CASes and re-use them across multiple requests - creating a CAS from scratch takes a moment.

    Some components may be thread-safe, others may not. This is largely up to the implementation of the underlying third-party library. But also the wrappers in DKPro Core are not explicitly built to be thread-safe. For example, in the default configuration, models are loaded and used depending on the document language. If you use the same instance of an analysis engine from multiple threads, this would cause problems.

    Again, you should consider creating a pool of pre-instantiated pipelines. You would need quite a bit of memory though, because each instance will be loading their own models. There is some experimental functionality to share models between instances of the same component, but it is not tested too much. Mind that third-party tools may also have implemented their models in a non-thread-safe manner. For model sharing in DKPro Core, see this discussion on the mailing list.

    Disclosure: I am one of the DKPro Core developers.