Pyspark TypeError when using reduceByKey function to sum text length

I would like to know why I am getting a type error when trying to calculate the total length of all characters within each list per given name (key), in the data below using the reduceByKey function.

data = [("Cassavetes, Frank", 'Orange'),
("Cassavetes, Frank", 'Pineapple'),
("Knight, Shirley (I)", 'Apple'),
("Knight, Shirley (I)", 'Blueberries'),
("Knight, Shirley (I)", 'Orange'),
("Yip, Françoise", 'Grapes'),
("Yip, Françoise", 'Apple'),
("Yip, Françoise", 'Strawberries'),
("Danner, Blythe", 'Pear'),
("Buck (X)", 'Kiwi')]

In an attempt to do this I tried to execute the code below;

rdd = spark.sparkContext.parallelize(data)
reducedRdd = rdd.reduceByKey( lambda a,b: len(a) + len(b) )
reducedRdd.collect()

The code above produces gives me the following error:

TypeError: object of type 'int' has no len()

The output I expected is as follows;

[('Yip, Françoise', 14), ('Cassavetes, Frank', 15), ('Knight, Shirley (I)', 8), ('Danner, Blythe', 'Pear'), ('Buck (X)', 'Kiwi')]

I have noticed the code below produces the desired results;

reducedRdd = rdd.reduceByKey( lambda a,b: len(str(a)) + len(str(b)) )

Though i am not sure why i would need to convert the variables a and b into strings if they are originally strings to begin with for example i am not sure how the 'Orange' in ("Cassavetes, Frank", 'Orange') can be considered an int.

PS i know i can use a number of other functions to achieve the desired results, but i specifically want to know why i am having issues trying to do this using the reduceByKey function.

Solution

The problem in your code is that the reduce function you pass to reduceByKey doesn't produce the same data type as the RDD values. The lambda function returns an int while your values are of type string.

To understand this simply consider how the reduce works. The function is applied to the first 2 values, then the result of the function is added to the third value, and so on...

Note that even the one that worked for you isn't actually correct. For example, it returns ('Danner, Blythe', 'Pear') instead of ('Danner, Blythe', 4).

You should first transform the values into their corresponding length then reduce by key :

reducedRdd = rdd.mapValues(lambda x: len(x)).reduceByKey(lambda a, b: a + b)
print(reducedRdd.collect())
# [('Cassavetes, Frank', 15), ('Danner, Blythe', 4), ('Buck (X)', 4), ('Knight, Shirley (I)', 22), ('Yip, Françoise', 23)]