Search code examples
scalaapache-sparkapache-spark-mllibjohnsnowlabs-spark-nlp

Mix Smark MLLIB and SparkNLP in pipeline


In a MLLIB pipeline, how can I chain a CountVectorizer (from SparkML) after a Stemmer (from Spark NLP) ?

When I try to use both in a pipeline I get:

myColName must be of type equal to one of the following types: [array<string>, array<string>] but was actually of type array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>.

Regards,


Solution

  • You need to add a Finisher in your Spark NLP pipeline. Try that:

      val documentAssembler =
        new DocumentAssembler().setInputCol("text").setOutputCol("document")
      val sentenceDetector =
        new SentenceDetector().setInputCols("document").setOutputCol("sentences")
      val tokenizer =
        new Tokenizer().setInputCols("sentences").setOutputCol("token")
      val stemmer = new Stemmer()
        .setInputCols("token")
        .setOutputCol("stem")
    
      val finisher = new Finisher()
        .setInputCols("stem")
        .setOutputCols("token_features")
        .setOutputAsArray(true)
        .setCleanAnnotations(false)
    
      val cv = new CountVectorizer()
        .setInputCol("token_features")
        .setOutputCol("features")
    
      val pipeline = new Pipeline()
        .setStages(
          Array(
            documentAssembler,
            sentenceDetector,
            tokenizer,
            stemmer,
            finisher,
            cv
          ))
    
    val data =
      Seq("Peter Pipers employees are picking pecks of pickled peppers.")
        .toDF("text")
    
    val model = pipeline.fit(data)
    val df = model.transform(data)
    

    output:

    +--------------------------------------------------------------------+
    |features                                                            |
    +--------------------------------------------------------------------+
    |(10,[0,1,2,3,4,5,6,7,8,9],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
    +--------------------------------------------------------------------+