Search code examples
pythonapache-sparkpysparkreduce

How does reduceByKey() in pyspark knows which column is key and which one is value?


I'm a newbie to Pyspark and going through this.

How does reduceByKey() know whether it should consider the first column as and second as value or vice-versa.

I don't see any column name or index mentioned in the code in reduceByKey(). Does reduceByKey() by default considers the first column as key and second as value?

How to perform reduceByKey() if there are multiple columns in the dataframe?

I'm aware of df.select(col1, col2).reduceByKey(). I'm just looking if there is any other way.


Solution

  • I'm not sure of which version of Spark you are on, but it should not matter that much. I'll assume you are on version 3.4.1, which is the latest version as of the time of writing this.

    If we take a look at the function signature of reduceByKey in the source code, we see this:

        def reduceByKey(
            self: "RDD[Tuple[K, V]]",
            func: Callable[[V, V], V],
            numPartitions: Optional[int] = None,
            partitionFunc: Callable[[K], int] = portable_hash,
        ) -> "RDD[Tuple[K, V]]":
    

    So indeed, this function expects your RDD to be of type Tuple[K, V] where K stands for key and V stands for value.

    Now, if you perform reduceByKey where you have more columns, you can just turn them into a single value column that is of itself a tuple of values.

    An example would be your example website's data, where we add an extra column:

    data = [
        ("Project", 1, 2),
        ("Gutenberg’s", 1, 2),
        ("Alice’s", 1, 2),
        ("Adventures", 1, 2),
        ("in", 1, 2),
        ("Wonderland", 1, 2),
        ("Project", 1, 2),
        ("Gutenberg’s", 1, 2),
        ("Adventures", 1, 2),
        ("in", 1, 2),
        ("Wonderland", 1, 2),
        ("Project", 1, 2),
        ("Gutenberg’s", 1, 2),
    ]
    

    Let's turn rdd into an RDD with the correct shape (2 columns, a key and a value column):

    rdd2 = rdd.map(lambda x: (x[0], (x[1], x[2])))
    
    rdd2.collect()
    [('Project', (1, 2)), ('Gutenberg’s', (1, 2)), ('Alice’s', (1, 2)), ('Adventures', (1, 2)), ('in', (1, 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg’s', (1, 2)), ('Adventures', (1, 2)), ('in', (1
    , 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg’s', (1, 2))]
    

    And let's say we want to reduce the 2 columns, but for the first column we want to do a sum (like in your example site) and the second one we want to multiply:

    rdd3 = rdd2.reduceByKey(lambda a, b: (a[0] + b[0], a[1] * b[1]))
    
    rdd3.collect()
    [('Alice’s', (1, 2)), ('in', (2, 4)), ('Project', (3, 8)), ('Gutenberg’s', (3, 8)), ('Adventures', (2, 4)), ('Wonderland', (2, 4))]