I'm a newbie to Pyspark and going through this.
How does reduceByKey()
know whether it should consider the first column as and second as value or vice-versa.
I don't see any column name or index mentioned in the code in reduceByKey()
. Does reduceByKey()
by default considers the first column as key and second as value?
How to perform reduceByKey()
if there are multiple columns in the dataframe?
I'm aware of df.select(col1, col2).reduceByKey()
. I'm just looking if there is any other way.
I'm not sure of which version of Spark you are on, but it should not matter that much. I'll assume you are on version 3.4.1, which is the latest version as of the time of writing this.
If we take a look at the function signature of reduceByKey
in the source code, we see this:
def reduceByKey(
self: "RDD[Tuple[K, V]]",
func: Callable[[V, V], V],
numPartitions: Optional[int] = None,
partitionFunc: Callable[[K], int] = portable_hash,
) -> "RDD[Tuple[K, V]]":
So indeed, this function expects your RDD
to be of type Tuple[K, V]
where K
stands for key and V
stands for value.
Now, if you perform reduceByKey
where you have more columns, you can just turn them into a single value column that is of itself a tuple of values.
An example would be your example website's data, where we add an extra column:
data = [
("Project", 1, 2),
("Gutenberg’s", 1, 2),
("Alice’s", 1, 2),
("Adventures", 1, 2),
("in", 1, 2),
("Wonderland", 1, 2),
("Project", 1, 2),
("Gutenberg’s", 1, 2),
("Adventures", 1, 2),
("in", 1, 2),
("Wonderland", 1, 2),
("Project", 1, 2),
("Gutenberg’s", 1, 2),
]
Let's turn rdd
into an RDD with the correct shape (2 columns, a key and a value column):
rdd2 = rdd.map(lambda x: (x[0], (x[1], x[2])))
rdd2.collect()
[('Project', (1, 2)), ('Gutenberg’s', (1, 2)), ('Alice’s', (1, 2)), ('Adventures', (1, 2)), ('in', (1, 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg’s', (1, 2)), ('Adventures', (1, 2)), ('in', (1
, 2)), ('Wonderland', (1, 2)), ('Project', (1, 2)), ('Gutenberg’s', (1, 2))]
And let's say we want to reduce the 2 columns, but for the first column we want to do a sum (like in your example site) and the second one we want to multiply:
rdd3 = rdd2.reduceByKey(lambda a, b: (a[0] + b[0], a[1] * b[1]))
rdd3.collect()
[('Alice’s', (1, 2)), ('in', (2, 4)), ('Project', (3, 8)), ('Gutenberg’s', (3, 8)), ('Adventures', (2, 4)), ('Wonderland', (2, 4))]