Search code examples
scalaapache-sparknlpapache-spark-mljohnsnowlabs-spark-nlp

How to use JohnSnowLabs NLP Spell correction module NorvigSweetingModel?


I was going through the JohnSnowLabs SpellChecker here.

I found the Norvig's algorithm implementation there, and the example section has just the following two lines:

import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
NorvigSweetingModel.pretrained()

How can I apply this pretrained model on my dataframe (df)below for spell correcting the "names" column?

+----------------+---+------------+
|           names|age|       color|
+----------------+---+------------+
|      [abc, cde]| 19|    red, abc|
|[eefg, efa, efb]|192|efg, efz efz|
+----------------+---+------------+

I have tried to do it as follows:

val schk = NorvigSweetingModel.pretrained().setInputCols("names").setOutputCol("Corrected")

val cdf = schk.transform(df)

But the above code gave me the following error:

java.lang.IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in SPELL_a1f11bacb851. Received inputCols: names. Make sure such columns have following annotator types: token
  at scala.Predef$.require(Predef.scala:224)
  at com.johnsnowlabs.nlp.AnnotatorModel.transform(AnnotatorModel.scala:51)
  ... 49 elided

Solution

  • spark-nlp are designed to be used in its own specific pipelines and input columns for different transformers have to include special metadata.

    The exception already tells you that input to the NorvigSweetingModel should be tokenized:

    Make sure such columns have following annotator types: token

    If I am not mistaken, at minimum you'll have assemble documents and tokenized here.

    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import org.apache.spark.ml.Pipeline
    
    val df = Seq(Seq("abc", "cde"), Seq("eefg", "efa", "efb")).toDF("names")
    
    val nlpPipeline = new Pipeline().setStages(Array(
      new DocumentAssembler().setInputCol("names").setOutputCol("document"),
      new Tokenizer().setInputCols("document").setOutputCol("tokens"),
      NorvigSweetingModel.pretrained().setInputCols("tokens").setOutputCol("corrected")
    ))
    

    A Pipeline like this, can be applied on your data with small adjustment - input data has to be string not array<string>*:

    val result = df
      .transform(_.withColumn("names", concat_ws(" ", $"names")))
      .transform(df => nlpPipeline.fit(df).transform(df))
    result.show()
    
    +------------+--------------------+--------------------+--------------------+
    |       names|            document|              tokens|           corrected|
    +------------+--------------------+--------------------+--------------------+
    |     abc cde|[[document, 0, 6,...|[[token, 0, 2, ab...|[[token, 0, 2, ab...|
    |eefg efa efb|[[document, 0, 11...|[[token, 0, 3, ee...|[[token, 0, 3, ee...|
    +------------+--------------------+--------------------+--------------------+
    

    If you want an output that can be exported you should extend your Pipeline with Finisher.

    import com.johnsnowlabs.nlp.Finisher
    
    new Finisher().setInputCols("corrected").transform(result).show
    
     +------------+------------------+
     |       names|finished_corrected|
     +------------+------------------+
     |     abc cde|        [abc, cde]|
     |eefg efa efb|  [eefg, efa, efb]|
     +------------+------------------+
    

    * According to the docs DocumentAssembler

    can read either a String column or an Array[String]

    but it doesn't look like it works in practice in 1.7.3:

    df.transform(df => nlpPipeline.fit(df).transform(df)).show()
    
    org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(names)' due to data type mismatch: argument 1 requires string type, however, '`names`' is of array<string> type.;;
    'Project [names#62, UDF(names#62) AS document#343]
    +- AnalysisBarrier
          +- Project [value#60 AS names#62]
             +- LocalRelation [value#60]