I was going through the JohnSnowLabs SpellChecker here.
I found the Norvig
's algorithm implementation there, and the example section has just the following two lines:
import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
NorvigSweetingModel.pretrained()
How can I apply this pretrained model on my dataframe (df
)below for spell correcting the "names
" column?
+----------------+---+------------+
| names|age| color|
+----------------+---+------------+
| [abc, cde]| 19| red, abc|
|[eefg, efa, efb]|192|efg, efz efz|
+----------------+---+------------+
I have tried to do it as follows:
val schk = NorvigSweetingModel.pretrained().setInputCols("names").setOutputCol("Corrected")
val cdf = schk.transform(df)
But the above code gave me the following error:
java.lang.IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in SPELL_a1f11bacb851. Received inputCols: names. Make sure such columns have following annotator types: token
at scala.Predef$.require(Predef.scala:224)
at com.johnsnowlabs.nlp.AnnotatorModel.transform(AnnotatorModel.scala:51)
... 49 elided
spark-nlp
are designed to be used in its own specific pipelines and input columns for different transformers have to include special metadata.
The exception already tells you that input to the NorvigSweetingModel
should be tokenized:
Make sure such columns have following annotator types: token
If I am not mistaken, at minimum you'll have assemble documents and tokenized here.
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.NorvigSweetingModel
import com.johnsnowlabs.nlp.annotators.Tokenizer
import org.apache.spark.ml.Pipeline
val df = Seq(Seq("abc", "cde"), Seq("eefg", "efa", "efb")).toDF("names")
val nlpPipeline = new Pipeline().setStages(Array(
new DocumentAssembler().setInputCol("names").setOutputCol("document"),
new Tokenizer().setInputCols("document").setOutputCol("tokens"),
NorvigSweetingModel.pretrained().setInputCols("tokens").setOutputCol("corrected")
))
A Pipeline
like this, can be applied on your data with small adjustment - input data has to be string
not array<string>
*:
val result = df
.transform(_.withColumn("names", concat_ws(" ", $"names")))
.transform(df => nlpPipeline.fit(df).transform(df))
result.show()
+------------+--------------------+--------------------+--------------------+
| names| document| tokens| corrected|
+------------+--------------------+--------------------+--------------------+
| abc cde|[[document, 0, 6,...|[[token, 0, 2, ab...|[[token, 0, 2, ab...|
|eefg efa efb|[[document, 0, 11...|[[token, 0, 3, ee...|[[token, 0, 3, ee...|
+------------+--------------------+--------------------+--------------------+
If you want an output that can be exported you should extend your Pipeline
with Finisher
.
import com.johnsnowlabs.nlp.Finisher
new Finisher().setInputCols("corrected").transform(result).show
+------------+------------------+
| names|finished_corrected|
+------------+------------------+
| abc cde| [abc, cde]|
|eefg efa efb| [eefg, efa, efb]|
+------------+------------------+
* According to the docs DocumentAssembler
can read either a String column or an Array[String]
but it doesn't look like it works in practice in 1.7.3:
df.transform(df => nlpPipeline.fit(df).transform(df)).show()
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(names)' due to data type mismatch: argument 1 requires string type, however, '`names`' is of array<string> type.;;
'Project [names#62, UDF(names#62) AS document#343]
+- AnalysisBarrier
+- Project [value#60 AS names#62]
+- LocalRelation [value#60]