I am trying to set up a simple code where I pass a dataframe and test it with the pretrained explain pipeline provided by johnSnowLabs Spark-NLP library. I am using jupyter notebooks from anaconda and have a spark scala kernet setup using apache toree. Everytime I run the step where it should load the pretrained pipeline, it throws a tensorflow error. Is there a way we can run this on windows locally?
I was trying this in a maven project earlier and the same error had happened. Another colleague tried it on a linux system and it worked. Below is the code I have tried and the error that it gave.
import org.apache.spark.ml.PipelineModel
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession
.builder()
.appName("test")
.master("local[*]")
.config("spark.driver.memory", "4G")
.config("spark.kryoserializer.buffer.max", "200M")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
val testData = spark.createDataFrame(Seq(
(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"),
(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States"))).toDF("id", "text")
val pipeline = PretrainedPipeline("explain_document_dl", lang = "en") //this is where it gives error
val annotation = pipeline.transform(testData)
annotation.show()
annotation.select("entities.result").show(false)
Below error occurs:
Name: java.lang.UnsupportedOperationException Message: Spark NLP tried to load a Tensorflow Graph using Contrib module, but failed to load it on this system. If you are on Windows, this operation is not supported. Please try a noncontrib model. If not the case, please report this issue. Original error message:
Op type not registered 'BlockLSTM' in binary running on 'MyMachine'. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.)
tf.contrib.resampler
should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed. StackTrace: Op type not registered 'BlockLSTM' in binary running on 'MyMachine'. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.)tf.contrib.resampler
should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper$.readGraph(TensorflowWrapper.scala:163) at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper$.read(TensorflowWrapper.scala:202) at com.johnsnowlabs.ml.tensorflow.ReadTensorflowModel$class.readTensorflowModel(TensorflowSerializeModel.scala:73) at com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel$.readTensorflowModel(NerDLModel.scala:134) at com.johnsnowlabs.nlp.annotators.ner.dl.ReadsNERGraph$class.readNerGraph(NerDLModel.scala:112) at com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel$.readNerGraph(NerDLModel.scala:134) at com.johnsnowlabs.nlp.annotators.ner.dl.ReadsNERGraph$$anonfun$2.apply(NerDLModel.scala:116) at com.johnsnowlabs.nlp.annotators.ner.dl.ReadsNERGraph$$anonfun$2.apply(NerDLModel.scala:116) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$com$johnsnowlabs$nlp$ParamsAndFeaturesReadable$$onRead$1.apply(ParamsAndFeaturesReadable.scala:31) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$com$johnsnowlabs$nlp$ParamsAndFeaturesReadable$$onRead$1.apply(ParamsAndFeaturesReadable.scala:30) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$class.com$johnsnowlabs$nlp$ParamsAndFeaturesReadable$$onRead(ParamsAndFeaturesReadable.scala:30) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$read$1.apply(ParamsAndFeaturesReadable.scala:41) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$read$1.apply(ParamsAndFeaturesReadable.scala:41) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:19) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8) at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:652) at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:274) at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:272) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:272) at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348) at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:135) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:129) at com.johnsnowlabs.nlp.pretrained.PretrainedPipelinenter code here
e.(PretrainedPipeline.scala:14)
I checked, there is an NER model inside that pipeline. That NER model was trained by using TensorFlow and it has some contrib
code inside which is only compatible with Unix-based OS such as Linux and macOS. There is an open issue here:
https://github.com/tensorflow/tensorflow/issues/26468
For that purpose, they have released some Windows-compatible pipelines which are named noncontrib
. You can change the name of the pipeline to the following:
val pipeline = PretrainedPipeline("explain_document_dl_noncontrib", lang = "en")
The source for all pre-trained pipelines: https://nlp.johnsnowlabs.com/docs/en/pipelines
Full disclosure: I am one of the contributors to the Spark NLP library.
UPDATE: Since the release of Spark NLP 2.4.0
, all the models and pipelines are now cross-platform: https://github.com/JohnSnowLabs/spark-nlp-models
This should work on Linux, macOS and Windows if you are using Spark NLP 2.4.0 release:
val pipeline = PretrainedPipeline("explain_document_dl", lang = "en")
UPDATE 2022: With the exception of M1 and aarch64 architectures (for now), all the 5000+ models/pipelines are compatible with Windows (8, 10, and 11), Linux (Ubuntu, Debian, CentOS, and etc.), and macOS operating systems. Spark NLP Models Hub: https://nlp.johnsnowlabs.com/models