Working in SPARK Databricks I want to find the word of biggest length from the RDD wordRDD.
I have created a function in Python:
def compare_strings_len(x, y):
if len(x) > len(y):
print(x)
elif len(x) < len(y):
print('String 2 is longer: ', y)
else:
print(min(x,y))
and I want to insert this function inside reduce with the code below:
word_length = (
wordRDD
.map(lambda x : (x, 1))
.reduceByKey(lambda x, y : compare_strings_len)
)
print(word_length)
The result I get is:
PythonRDD[151] at RDD at PythonRDD.scala:58
What I am doing wrong?
Thx
There are few problems here:
def compare_strings_len(x, y):
if len(x) > len(y):
return x
return y
instead of reduceByKey
, the reduce
function should be used
you need to get result from RDD after performing reduce, like: word_length[0]
P.S. Frankly speaking, I would use Spark SQL instead of that.