I am trying to build a Dataproc cluster, with Spark NLP installed in it, then quick test it by reading some CoNLL 2003 data. First, I used this codelab as inspiration, to build my own smaller cluster (project
name has been edited for safety purposes):
gcloud dataproc clusters create s17-sparknlp-experiments \
--enable-component-gateway \
--region us-west1 \
--metadata 'PIP_PACKAGES=google-cloud-storage spark-nlp==2.5.5' \
--zone us-west1-a \
--single-node \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 35 \
--image-version 1.5-debian10 \
--initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh \
--optional-components JUPYTER,ANACONDA \
--project my-project
I started the previous cluster via JupyterLab, then downloaded these CoNLL 2003 files in ~/original
directory, existing in root . If done correctly, when you run these commands:
cd / && head -n 5 original/eng.train
The following result should obtained:
-DOCSTART- -X- -X- O
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
This means these files should be able to be read in the following Python code, existing in a single-celled Jupyter Notebook:
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.training import CoNLL
import sparknlp
spark = sparknlp.start()
print("Spark NLP version: ", sparknlp.version()) # 2.4.4
print("Apache Spark version: ", spark.version) # 2.4.8
# Other info of possible interest:
# Python 3.6.13 :: Anaconda, Inc.
# openjdk version "1.8.0_312"
# OpenJDK Runtime Environment (Temurin)(build 1.8.0_312-b07)
# OpenJDK 64-Bit Server VM (Temurin)(build 25.312-b07, mixed mode)
training_data = CoNLL().readDataset(spark, 'original/eng.train') # The exact same path used before...
training_data.show()
Instead, the following error gets triggered:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-4-2b145ab3b733> in <module>
----> 1 training_data = CoNLL().readDataset(spark, 'original/eng.train')
2 training_data.show()
/opt/conda/anaconda/lib/python3.6/site-packages/sparknlp/training.py in readDataset(self, spark, path, read_as)
32 jSession = spark._jsparkSession
33
---> 34 jdf = self._java_obj.readDataset(jSession, path, read_as)
35 return DataFrame(jdf, spark._wrapped)
36
/opt/conda/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/opt/conda/anaconda/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o87.readDataset.
: java.io.FileNotFoundException: file or folder: original/eng.train not found
at com.johnsnowlabs.nlp.util.io.ResourceHelper$SourceStream.<init>(ResourceHelper.scala:44)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$.parseLines(ResourceHelper.scala:215)
at com.johnsnowlabs.nlp.training.CoNLL.readDocs(CoNLL.scala:31)
at com.johnsnowlabs.nlp.training.CoNLL.readDataset(CoNLL.scala:198)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
QUESTION: What could be possibly going wrong here?
@jccampanero led me in the right direction, however with some tweaks. In specific, you must store the files you want to import, in some Google Cloud Storage bucket; then use that file URI in readDataset
:
training_data = CoNLL().readDataset(spark, 'gs://my-bucket/subfolders/eng.train')
This is not the only valid option to achieve what I am looking for, there are more.