Search code examples
javaapache-sparknlpjohnsnowlabs-spark-nlp

SparkNLP PipelineModel which includes AnnotatorApproach in stages


In a SparkNLP's PipelineModel all the stages have to be of type AnnotatorModel. But what if one of those annotatormodels requires a certain column in the dataset as input and this input column is the output of an AnnotatorApproach?

For instance, I have a trained model for NER (as the last stage of the pipeline) which requires tokens and POS tags as two of the inputs. The tokens are also required by the POS tagger. But the Tokenizer is an AnnotatorApproach and I am not able to add this to the pipeline.

This is how the Tokenizer is instantiated (in Java):

AnnotatorApproach<TokenizerModel> tokenizer = new Tokenizer();

This works:

Pipeline pipeline = new Pipeline().setStages( new PipelineStage[]{tokenizer} );

But this doesn't work, because Tokenizer is not a Transformer:

List<Transformer> list;
list.add(tokenizer);
PipelineModel pipelineModel = new PipelineModel("ID42", list);

Solution

  • Always fitting the pipeline will return you a pipeline ready for inference, even when you fit on an empty dataset. If you're only depending on annotators that don't require training that's ok. That's the recommended usage, manipulating the individual stages in typically not necessary, hacky, and can lead to errors.