I have created an rdd with this form in PySpark:
[(0, ('This', 1)), (0, ('is', 1)), (0, ('the', 1)), (0, ('100th', 1)), (0, ('Etext', 1)), (0, ('file', 1)), (0, ('presented', 1)), (0, ('by', 1)), (0, ('Project', 1)), (0, ('Gutenberg,', 1)), (0, ('and', 1)), (1, ('is', 1)), (1, ('presented', 1)), (1, ('in', 1)), (1, ('cooperation', 1)), (1, ('with', 1)), (1, ('World', 1)), (1, ('Library,', 1)), (1, ('Inc.,', 1)), (1, ('from', 1))]
For the first item this
is located in the first row(0)
. I have added 1 on it's right to get the frequency.
I cannot find a way to solve this problem. The output I am expecting after the use of aggregateByKey
or reduceByKey
is for example: In line zero the word This
was used 1 time etc...
[(0, ('This', 1, 'is', 1, 'the', 1, ...)), ...]
The find the number of occurrences of word in every line and combine them together:
rdd = spark.sparkContext.parallelize([(0, ('a', 1)), (0, ('b', 1)), (0, ('a', 1)), (1, ('a', 1))])
occurences_per_line = rdd.map(lambda x: ((x[0], x[1][0]), x[1][1])).reduceByKey(lambda x, y: x + y)
occurences_per_line.map(lambda x: (x[0][0], (x[0][1], x[1]))).reduceByKey(lambda x, y: x + y).collect()
"""
[(0, ('a', 2, 'b', 1)), (1, ('a', 1))]
"""