scala apache-spark apache-spark-mllib johnsnowlabs-spark-nlp

Mix Smark MLLIB and SparkNLP in pipeline

In a MLLIB pipeline, how can I chain a CountVectorizer (from SparkML) after a Stemmer (from Spark NLP) ?

When I try to use both in a pipeline I get:

myColName must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>.

Regards,

Solution

You need to add a Finisher in your Spark NLP pipeline. Try that:

  val documentAssembler =
    new DocumentAssembler().setInputCol("text").setOutputCol("document")
  val sentenceDetector =
    new SentenceDetector().setInputCols("document").setOutputCol("sentences")
  val tokenizer =
    new Tokenizer().setInputCols("sentences").setOutputCol("token")
  val stemmer = new Stemmer()
    .setInputCols("token")
    .setOutputCol("stem")

  val finisher = new Finisher()
    .setInputCols("stem")
    .setOutputCols("token_features")
    .setOutputAsArray(true)
    .setCleanAnnotations(false)

  val cv = new CountVectorizer()
    .setInputCol("token_features")
    .setOutputCol("features")

  val pipeline = new Pipeline()
    .setStages(
      Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        stemmer,
        finisher,
        cv
      ))

val data =
  Seq("Peter Pipers employees are picking pecks of pickled peppers.")
    .toDF("text")

val model = pipeline.fit(data)
val df = model.transform(data)

output:

+--------------------------------------------------------------------+
|features                                                            |
+--------------------------------------------------------------------+
|(10,[0,1,2,3,4,5,6,7,8,9],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
+--------------------------------------------------------------------+