Search code examples
pythonfunctionapache-sparkrdddatabricks

Finding the word with maximal length from RDD in SPARK


Working in SPARK Databricks I want to find the word of biggest length from the RDD wordRDD.

I have created a function in Python:

def compare_strings_len(x, y):
if len(x) > len(y):
    print(x)
elif len(x) < len(y):
    print('String 2 is longer: ', y)
else:
    print(min(x,y))    

and I want to insert this function inside reduce with the code below:

    word_length = (
    wordRDD
    .map(lambda x : (x, 1))
    .reduceByKey(lambda x, y : compare_strings_len)
)
print(word_length)

The result I get is:

PythonRDD[151] at RDD at PythonRDD.scala:58

What I am doing wrong?

Thx


Solution

  • There are few problems here:

    • reduce function need to return back the actual value, not just print result. It should be something like this:
    def compare_strings_len(x, y):
      if len(x) > len(y):
        return x
    
      return y
    
    • instead of reduceByKey, the reduce function should be used

    • you need to get result from RDD after performing reduce, like: word_length[0]

    P.S. Frankly speaking, I would use Spark SQL instead of that.