I am a beginner to spark-nlp and i am learning it by following examples in the johnsnowlabs. I am using SCALA in data bricks.
When i follow the example as follows,
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler().
setInputCol("text").
setOutputCol("document")
val regexTokenizer = new Tokenizer().
setInputCols(Array("sentence")).
setOutputCol("token")
val sentenceDetector = new SentenceDetector().
setInputCols(Array("document")).
setOutputCol("sentence")
val finisher = new Finisher()
.setInputCols("token")
.setIncludeMetadata(true)
finisher.withColumn("newCol", explode(arrays_zip($"finished_token", $"finished_ner")))
I am getting following error when i run the last line :
command-786892578143744:2: error: value withColumn is not a member of com.johnsnowlabs.nlp.Finisher
finisher.withColumn("newCol", explode(arrays_zip($"finished_token", $"finished_ner")))
what may be the reason for this ?
When i try to do the example , by just omitting this line , i added follwoing additional lines of codes
val pipeline = new Pipeline().
setStages(Array(
documentAssembler,
sentenceDetector,
regexTokenizer,
finisher
))
val data1 = Seq("hello, this is an example sentence").toDF("text")
pipeline.fit(data1).transform(data1).toDF("text")
I got another error when i run the last line :
java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Can anyone help me to fix this issue ?
Thank you
Here what your code should look like, first construct the Pipeline:
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler().
setInputCol("text").
setOutputCol("document")
val regexTokenizer = new Tokenizer().
setInputCols(Array("sentence")).
setOutputCol("token")
val sentenceDetector = new SentenceDetector().
setInputCols(Array("document")).
setOutputCol("sentence")
val finisher = new Finisher()
.setInputCols("token")
.setIncludeMetadata(true)
val pipeline = new Pipeline().
setStages(Array(
documentAssembler,
sentenceDetector,
regexTokenizer,
finisher
))
Create a simple DataFrame for testing:
val data1 = Seq("hello, this is an example sentence").toDF("text")
Now we fit and transform your DataFrame on this Pipeline:
val prediction = pipeline.fit(data1).transform(data1)
The variable prediction
is a DataFrame which in that you can explode the token column. Let's have a look inside prediction
DataFrame:
scala> prediction.show
+--------------------+--------------------+-----------------------+
| text| finished_token|finished_token_metadata|
+--------------------+--------------------+-----------------------+
|hello, this is an...|[hello, ,, this, ...| [[sentence, 0], [...|
+--------------------+--------------------+-----------------------+
scala> prediction.withColumn("newCol", explode($"finished_token")).show
+--------------------+--------------------+-----------------------+--------+
| text| finished_token|finished_token_metadata| newCol|
+--------------------+--------------------+-----------------------+--------+
|hello, this is an...|[hello, ,, this, ...| [[sentence, 0], [...| hello|
|hello, this is an...|[hello, ,, this, ...| [[sentence, 0], [...| ,|
|hello, this is an...|[hello, ,, this, ...| [[sentence, 0], [...| this|
|hello, this is an...|[hello, ,, this, ...| [[sentence, 0], [...| is|
|hello, this is an...|[hello, ,, this, ...| [[sentence, 0], [...| an|
|hello, this is an...|[hello, ,, this, ...| [[sentence, 0], [...| example|
|hello, this is an...|[hello, ,, this, ...| [[sentence, 0], [...|sentence|
+--------------------+--------------------+-----------------------+--------+
Your first issue as Alberto mentioned, thinking that finisher
was a DataFrame. It is an annotator until it is transformed.
The Second issue was having .toDF() in a place you didn't need it. (after pipeline transformation)
Your explode function being in a bad place aside, you are zipping a column that doesn't even exist in your pipeline: ner
Please feel free to ask any question and I'll update the answer accordingly.