I'm new to NLP and started with the spark-nlp package for Python. I trained a simple NER model, which I saved and now want to use. However, I am facing the problem of wrong or missing inputCols
, despite the dataframe looking accurate. What am I missing here?
I have tried different approaches of using the DocumentAssembler
, SentenceDetector
and Tokenizer
. However, none seem to work. This is my code:
from pyspark.ml import PipelineModel
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import SentenceDetector, Tokenizer
from pyspark.ml.feature import Tokenizer
import sparknlp
spark = sparknlp.start()
loaded_model = PipelineModel.load("bert_diseases")
sentences = [
['Hello, this is an example sentence'],
['And this is a second sentence.']
]
data = spark.createDataFrame(sentences).toDF("text")
document = DocumentAssembler().setInputCol("text").setOutputCol("document")
sentence = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
token = Tokenizer().setInputCol("text").setOutputCol("token")
data.show()
documents = document.transform(data)
documents.show()
sentences = sentence.transform(documents)
sentences.show()
tokens = token.transform(sentences)
tokens.show()
result = loaded_model.transform(tokens)
result.show()
first part is working as expected
however, I get this error as soon as I try to transform the data with my model
I have also reviewed this question. Unfortunately, it did not really help...
Please also see the metadata of the model I use:
{"class":"com.johnsnowlabs.nlp.embeddings.BertEmbeddings","timestamp":1688164205140,"sparkVersion":"3.4.1","uid":"BERT_EMBEDDINGS_e3d4eaf62b32","paramMap":{"outputCol":"embeddings","dimension":768,"caseSensitive":false,"inputCols":["sentence","token"],"storageRef":"small_bert_L2_768"},"defaultParamMap":{"lazyAnnotator":false,"dimension":768,"caseSensitive":false,"engine":"tensorflow","storageRef":"BERT_EMBEDDINGS_e3d4eaf62b32","maxSentenceLength":128,"batchSize":8}}
Thank you for your help in advance!
EDIT:
I figured it might have had something to do with the "token"-column being an array - is_nlp_annotator
was false as well. So I took another approach:
from pyspark.ml import PipelineModel
from sparknlp.training import CoNLL
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import sparknlp
spark = sparknlp.start()
loaded_model = PipelineModel.load("bert_diseases")
data = spark.createDataFrame([["I'd like to say we didn't expect that. Jane's boyfriend."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token").fit(data)
pipeline = Pipeline().setStages([documentAssembler, tokenizer]).fit(data)
tokenized = pipeline.transform(data)
tokenized.selectExpr("token.result").show(truncate=False)
tokenized.show()
inputData = tokenized.drop("text")
inputData.show()
result = loaded_model.transform(inputData)
result.show()
I got the idea from here. However, it still does not work, and I am confused as ever.
pyspark.errors.exceptions.captured.IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in BERT_EMBEDDINGS_e3d4eaf62b32.
Current inputCols: sentence,token. Dataset's columns:
(column_name=document,is_nlp_annotator=true,type=document)
(column_name=token,is_nlp_annotator=true,type=token).
Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: document, token
The dataframe looks correct, though...
+--------------------+--------------------+
| document| token|
+--------------------+--------------------+
|[{document, 0, 55...|[{token, 0, 2, I'...|
+--------------------+--------------------+
I fixed the problem. I seem to have been misled by the error message, since I thought that the document and token annotations were missing and necessary. I suppose it was to read as something like: You are missing a column "sentence" which should be in the "document"-annotation, and you are missing a column "token" which should be in the "token"-annotation. So this code works perfectly for me:
spark = sparknlp.start()
loaded_model = PipelineModel.load("bert_diseases")
data = spark.createDataFrame([[string_param]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
sentence = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token").fit(data)
pipeline = Pipeline().setStages([documentAssembler, sentence, tokenizer]).fit(data)
tokenized = pipeline.transform(data)
inputData = tokenized.drop("text")
result = loaded_model.transform(inputData)