python pyspark nltk apache-spark-ml lemmatization

# string methods TypeError: Column is not iterable in pyspark

Im trying to re-implement sentiment analysis which is written in python to pyspark as im working with bigdata, im new to pyspark syntax, and im getting an error while trying to apply lemmatization function from nltk package

Error: # string methods TypeError: Column is not iterable Below is the Code and Data

|overall|       reviewsummary|     cleanreviewText|         reviewText1|  filteredreviewText|
+-------+--------------------+--------------------+--------------------+--------------------+
|    5.0|exactly what i ne...|exactly what i ne...|[exactly, what, i...|[exactly, needed,...|
|    2.0|i agree with the ...|i agree with the ...|[i, agree, with, ...|[agree, review, o...|
|    4.0|love these... i a...|love these... i a...|[love, these, i, ...|[love, going, ord...|
|    2.0|too tiny an openi...|too tiny an openi...|[too, tiny, an, o...|[tiny, opening, t...|
|    3.0|    okay three stars|    okay three stars|[okay, three, stars]|[okay, three, stars]|
|    5.0|exactly what i wa...|exactly what i wa...|[exactly, what, i...|[exactly, wanted,...|
|    4.0|these little plas...|these little plas...|[these, little, p...|[little, plastic,...|
|    3.0|mother - in - law...|mother - in - law...|[mother, in, law,...|[mother, law, wan...|
|    3.0|item is of good q...|item is of good q...|[item, is, of, go...|[item, good, qual...|
|    3.0|i had used my las...|i had used my las...|[i, had, used, my...|[used, last, el, ...|
+-------+--------------------+--------------------+--------------------+--------------------+
only showing top 10 rows


In [18]: dfStopwordRemoved.printSchema()
root
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- reviewsummary: string (nullable = true)
 |-- cleanreviewText: string (nullable = true)
 |-- reviewText1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- filteredreviewText: array (nullable = true)
 |    |-- element: string (containsNull = true)

function lemmatize

def get_part_of_speech(word):
probable_part_of_speech = wordnet.synsets(word)

pos_counts = Counter()
pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )

most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
return most_likely_part_of_speech

def Lemmatizing_Words(Words):
Lm = WordNetLemmatizer()
Lemmatized_Words = []
for word in Words:
    Lemmatized_Words.append(Lm.lemmatize(word,get_part_of_speech(word)))
return Lemmatized_Words

(function calling)

x2=list()
for word in dfStopwordRemoved.select('filteredreviewText'):
x_temp = Lemmatizing_Words(word)
x2.append(x_temp)

please refer the below link for the error Error

Solution

The error message is accurate : you can iterate through a dataframe column like a standard python iterator. For applying a standard function such as sum,mean etc., we need to use the withColumn() or select() function. In your case you have your own custom function. So you need to register your function as a udf and use it along with withColumn() or select()

Below is the example of a udf from spark documents - https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html?highlight=udf#pyspark.sql.functions.udf

>>> from pyspark.sql.types import IntegerType
>>> slen = udf(lambda s: len(s), IntegerType())
>>> :udf
... def to_upper(s):
...     if s is not None:
...         return s.upper()
...
>>> :udf(returnType=IntegerType())
... def add_one(x):
...     if x is not None:
...         return x + 1
...
>>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age"))
>>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")).show()
+----------+--------------+------------+
|slen(name)|to_upper(name)|add_one(age)|
+----------+--------------+------------+
|         8|      JOHN DOE|          22|
+----------+--------------+------------+