Search code examples
google-cloud-platformpysparktextblob

Pyspark in GCP: ModuleNotFoundError: No module named 'textblob'


I am using udf function in Pyspark in jupyter notebook on GCP. I wanted to use Textblob to do the sentiment analysis on text. I have already imported textblob in the notebook and i have tried the following code in my virtual machine terminal

pip3 install -U textblob

When I try to run the following code

sentiment = udf(lambda x: TextBlob(x).sentiment[0])
spark.udf.register("sentiment", sentiment)
df = df.withColumn('sentiment',sentiment('text').cast('double'))
df.show(1)

I still got the following error

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 588, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 249, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'textblob'

I am new to the GCP and cloud computing. I don't know what is causing the problem. Is that because I didn't install the package into right path?


Solution

  • I think this is more of a jupyter notebook thing than a GCP one. But jupyter has %pip and %conda that you could use to install python module to the python instance which jupyter runs on.