Search code examples
apache-sparkpysparkrdd

Join RDD and get min value


I have multiple rdd's and want to get the common words by joining it and get the minimum count .So I Join and get it by below code :

from pyspark import SparkContext
sc = SparkContext("local", "Join app")
x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])
joined = x.join(y).map(lambda x: (x[0], int(x[1]))).reduceByKey(lambda (x,y,z) : (x,y)  if y<=z else (x,z))
final = joined.collect()
print "Join RDD -> %s" % (final)

But this throws below error:

TypeError: int() argument must be a string or a number, not 'tuple'

So I am inputiing a tuple instead of a number .Not sure which is causing it. Any help is appreciated


Solution

  • x.join(other, numPartitions=None): Return an RDD containing all pairs of elements with matching keys in C{self} and C{other}. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in C{self} and (k, v2) is in C{other}.

    Therefore you have a tuple as second element:

    In [2]: x.join(y).collect()
    Out[2]: [('spark', (1, 2)), ('hadoop', (4, 5))]
    

    Solution :

    x = sc.parallelize([("spark", 1), ("hadoop", 4)])
    y = sc.parallelize([("spark", 2), ("hadoop", 5)])
    joined = x.join(y)
    final = joined.map(lambda x: (x[0], min(x[1])))
    final.collect()
    >>> [('spark', 1), ('hadoop', 4)]