Finding the word with maximal length from RDD in SPARK

Working in SPARK Databricks I want to find the word of biggest length from the RDD wordRDD.

I have created a function in Python:

def compare_strings_len(x, y):
if len(x) > len(y):
    print(x)
elif len(x) < len(y):
    print('String 2 is longer: ', y)
else:
    print(min(x,y))

and I want to insert this function inside reduce with the code below:

    word_length = (
    wordRDD
    .map(lambda x : (x, 1))
    .reduceByKey(lambda x, y : compare_strings_len)
)
print(word_length)

The result I get is:

PythonRDD[151] at RDD at PythonRDD.scala:58

What I am doing wrong?

Thx

Solution

There are few problems here:

reduce function need to return back the actual value, not just print result. It should be something like this:

def compare_strings_len(x, y):
  if len(x) > len(y):
    return x

  return y

P.S. Frankly speaking, I would use Spark SQL instead of that.