Search code examples

Do Spark-NLP pretrained pipelines only work on linux systems?

I am trying to set up a simple code where I pass a dataframe and test it with the pretrained explain pipeline provided by johnSnowLabs Spark-NLP library. I am using jupyter notebooks from anaconda and have a spark scala kernet setup using apache toree. Everytime I run the step where it should load the pretrained pipeline, it throws a tensorflow error. Is there a way we can run this on windows locally?

I was trying this in a maven project earlier and the same error had happened. Another colleague tried it on a linux system and it worked. Below is the code I have tried and the error that it gave.

import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession
    .config("spark.driver.memory", "4G")
    .config("spark.kryoserializer.buffer.max", "200M")
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

val testData = spark.createDataFrame(Seq(
    (1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"),
    (2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States"))).toDF("id", "text")
val pipeline = PretrainedPipeline("explain_document_dl", lang = "en") //this is where it gives error
val annotation = pipeline.transform(testData)"entities.result").show(false)

Below error occurs:

Name: java.lang.UnsupportedOperationException Message: Spark NLP tried to load a Tensorflow Graph using Contrib module, but failed to load it on this system. If you are on Windows, this operation is not supported. Please try a noncontrib model. If not the case, please report this issue. Original error message:

Op type not registered 'BlockLSTM' in binary running on 'MyMachine'. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed. StackTrace: Op type not registered 'BlockLSTM' in binary running on 'MyMachine'. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
at$.readGraph(TensorflowWrapper.scala:163) at$.read(TensorflowWrapper.scala:202) at$class.readTensorflowModel(TensorflowSerializeModel.scala:73) at com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel$.readTensorflowModel(NerDLModel.scala:134) at com.johnsnowlabs.nlp.annotators.ner.dl.ReadsNERGraph$class.readNerGraph(NerDLModel.scala:112) at com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel$.readNerGraph(NerDLModel.scala:134) at com.johnsnowlabs.nlp.annotators.ner.dl.ReadsNERGraph$$anonfun$2.apply(NerDLModel.scala:116) at com.johnsnowlabs.nlp.annotators.ner.dl.ReadsNERGraph$$anonfun$2.apply(NerDLModel.scala:116) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$com$johnsnowlabs$nlp$ParamsAndFeaturesReadable$$onRead$1.apply(ParamsAndFeaturesReadable.scala:31) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$com$johnsnowlabs$nlp$ParamsAndFeaturesReadable$$onRead$1.apply(ParamsAndFeaturesReadable.scala:30) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$johnsnowlabs$nlp$ParamsAndFeaturesReadable$$onRead(ParamsAndFeaturesReadable.scala:30) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$read$1.apply(ParamsAndFeaturesReadable.scala:41) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable$$anonfun$read$1.apply(ParamsAndFeaturesReadable.scala:41) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:19) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:8) at$.loadParamsInstance(ReadWrite.scala:652) at$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:274) at$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:272) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$ at scala.collection.mutable.ArrayOps$
at$SharedReadWrite$.load(Pipeline.scala:272) at$PipelineModelReader.load(Pipeline.scala:348) at$PipelineModelReader.load(Pipeline.scala:342) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:135) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadPipeline(ResourceDownloader.scala:129) at com.johnsnowlabs.nlp.pretrained.PretrainedPipelinenter code heree.(PretrainedPipeline.scala:14)


  • I checked, there is an NER model inside that pipeline. That NER model was trained by using TensorFlow and it has some contrib code inside which is only compatible with Unix-based OS such as Linux and macOS. There is an open issue here:

    For that purpose, they have released some Windows-compatible pipelines which are named noncontrib. You can change the name of the pipeline to the following:

    val pipeline = PretrainedPipeline("explain_document_dl_noncontrib", lang = "en")

    The source for all pre-trained pipelines:

    Full disclosure: I am one of the contributors to the Spark NLP library.

    UPDATE: Since the release of Spark NLP 2.4.0, all the models and pipelines are now cross-platform:

    This should work on Linux, macOS and Windows if you are using Spark NLP 2.4.0 release:

    val pipeline = PretrainedPipeline("explain_document_dl", lang = "en")

    UPDATE 2022: With the exception of M1 and aarch64 architectures (for now), all the 5000+ models/pipelines are compatible with Windows (8, 10, and 11), Linux (Ubuntu, Debian, CentOS, and etc.), and macOS operating systems. Spark NLP Models Hub: