python pyspark nlp johnsnowlabs-spark-nlp

Wrong or missing inputCols annotators - spark-nlp

I'm new to NLP and started with the spark-nlp package for Python. I trained a simple NER model, which I saved and now want to use. However, I am facing the problem of wrong or missing inputCols, despite the dataframe looking accurate. What am I missing here?

I have tried different approaches of using the DocumentAssembler, SentenceDetector and Tokenizer. However, none seem to work. This is my code:

from pyspark.ml import PipelineModel
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import SentenceDetector, Tokenizer
from pyspark.ml.feature import Tokenizer
import sparknlp

spark = sparknlp.start()

loaded_model = PipelineModel.load("bert_diseases")

sentences = [
['Hello, this is an example sentence'],
['And this is a second sentence.']
]

data = spark.createDataFrame(sentences).toDF("text")

document = DocumentAssembler().setInputCol("text").setOutputCol("document")
sentence = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
token = Tokenizer().setInputCol("text").setOutputCol("token")

data.show()

documents = document.transform(data)
documents.show()

sentences = sentence.transform(documents)
sentences.show()

tokens = token.transform(sentences)
tokens.show()

result = loaded_model.transform(tokens)
result.show()

first part is working as expected

however, I get this error as soon as I try to transform the data with my model

I have also reviewed this question. Unfortunately, it did not really help...

Please also see the metadata of the model I use:

{"class":"com.johnsnowlabs.nlp.embeddings.BertEmbeddings","timestamp":1688164205140,"sparkVersion":"3.4.1","uid":"BERT_EMBEDDINGS_e3d4eaf62b32","paramMap":{"outputCol":"embeddings","dimension":768,"caseSensitive":false,"inputCols":["sentence","token"],"storageRef":"small_bert_L2_768"},"defaultParamMap":{"lazyAnnotator":false,"dimension":768,"caseSensitive":false,"engine":"tensorflow","storageRef":"BERT_EMBEDDINGS_e3d4eaf62b32","maxSentenceLength":128,"batchSize":8}}

Thank you for your help in advance!

EDIT:

I figured it might have had something to do with the "token"-column being an array - is_nlp_annotator was false as well. So I took another approach:

from pyspark.ml import PipelineModel
from sparknlp.training import CoNLL
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

import sparknlp
spark = sparknlp.start()

loaded_model = PipelineModel.load("bert_diseases")

data = spark.createDataFrame([["I'd like to say we didn't expect that. Jane's boyfriend."]]).toDF("text")

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token").fit(data)

pipeline = Pipeline().setStages([documentAssembler, tokenizer]).fit(data)
tokenized = pipeline.transform(data)

tokenized.selectExpr("token.result").show(truncate=False)
tokenized.show()

inputData = tokenized.drop("text")
inputData.show()

result = loaded_model.transform(inputData)
result.show()

I got the idea from here. However, it still does not work, and I am confused as ever.

pyspark.errors.exceptions.captured.IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in BERT_EMBEDDINGS_e3d4eaf62b32.

Current inputCols: sentence,token. Dataset's columns:
(column_name=document,is_nlp_annotator=true,type=document)
(column_name=token,is_nlp_annotator=true,type=token).
Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: document, token

The dataframe looks correct, though...

+--------------------+--------------------+
|            document|               token|
+--------------------+--------------------+
|[{document, 0, 55...|[{token, 0, 2, I'...|
+--------------------+--------------------+

Solution

I fixed the problem. I seem to have been misled by the error message, since I thought that the document and token annotations were missing and necessary. I suppose it was to read as something like: You are missing a column "sentence" which should be in the "document"-annotation, and you are missing a column "token" which should be in the "token"-annotation. So this code works perfectly for me:

spark = sparknlp.start()

loaded_model = PipelineModel.load("bert_diseases")

data = spark.createDataFrame([[string_param]]).toDF("text")

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
sentence = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token").fit(data)

pipeline = Pipeline().setStages([documentAssembler, sentence, tokenizer]).fit(data)
tokenized = pipeline.transform(data)

inputData = tokenized.drop("text")

result = loaded_model.transform(inputData)