Search code examples
scalaapache-sparkdatabricksjohnsnowlabs-spark-nlp

About an error regarding spark nlp using scala


I am a beginner to spark-nlp and i am learning it by following examples in the johnsnowlabs. I am using SCALA in data bricks.

When i follow the example as follows,

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler().
    setInputCol("text").
    setOutputCol("document")

val regexTokenizer = new Tokenizer().
    setInputCols(Array("sentence")).
    setOutputCol("token")
val sentenceDetector = new SentenceDetector().
    setInputCols(Array("document")).
    setOutputCol("sentence")

val finisher = new Finisher()
    .setInputCols("token")
    .setIncludeMetadata(true)


finisher.withColumn("newCol", explode(arrays_zip($"finished_token", $"finished_ner")))

I am getting following error when i run the last line :

command-786892578143744:2: error: value withColumn is not a member of com.johnsnowlabs.nlp.Finisher
finisher.withColumn("newCol", explode(arrays_zip($"finished_token", $"finished_ner")))

what may be the reason for this ?

When i try to do the example , by just omitting this line , i added follwoing additional lines of codes

val pipeline = new Pipeline().
    setStages(Array(
        documentAssembler,
        sentenceDetector,
        regexTokenizer,
        finisher
    ))

val data1 = Seq("hello, this is an example sentence").toDF("text")

pipeline.fit(data1).transform(data1).toDF("text")

I got another error when i run the last line :

java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.

Can anyone help me to fix this issue ?

Thank you


Solution

  • Here what your code should look like, first construct the Pipeline:

    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler().
        setInputCol("text").
        setOutputCol("document")
    
    val regexTokenizer = new Tokenizer().
        setInputCols(Array("sentence")).
        setOutputCol("token")
    val sentenceDetector = new SentenceDetector().
        setInputCols(Array("document")).
        setOutputCol("sentence")
    
    val finisher = new Finisher()
        .setInputCols("token")
        .setIncludeMetadata(true)
    
    val pipeline = new Pipeline().
        setStages(Array(
            documentAssembler,
            sentenceDetector,
            regexTokenizer,
            finisher
        ))
    

    Create a simple DataFrame for testing:

    val data1 = Seq("hello, this is an example sentence").toDF("text")
    

    Now we fit and transform your DataFrame on this Pipeline:

    val prediction = pipeline.fit(data1).transform(data1)
    
    

    The variable prediction is a DataFrame which in that you can explode the token column. Let's have a look inside prediction DataFrame:

    scala> prediction.show
    +--------------------+--------------------+-----------------------+
    |                text|      finished_token|finished_token_metadata|
    +--------------------+--------------------+-----------------------+
    |hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|
    +--------------------+--------------------+-----------------------+
    
    scala> prediction.withColumn("newCol", explode($"finished_token")).show
    +--------------------+--------------------+-----------------------+--------+
    |                text|      finished_token|finished_token_metadata|  newCol|
    +--------------------+--------------------+-----------------------+--------+
    |hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|   hello|
    |hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|       ,|
    |hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|    this|
    |hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|      is|
    |hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|      an|
    |hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...| example|
    |hello, this is an...|[hello, ,, this, ...|   [[sentence, 0], [...|sentence|
    +--------------------+--------------------+-----------------------+--------+
    
    
    • Your first issue as Alberto mentioned, thinking that finisher was a DataFrame. It is an annotator until it is transformed.

    • The Second issue was having .toDF() in a place you didn't need it. (after pipeline transformation)

    • Your explode function being in a bad place aside, you are zipping a column that doesn't even exist in your pipeline: ner

    Please feel free to ask any question and I'll update the answer accordingly.