Search code examples
pythonpysparkrddword-count

Count the frequency of each word in a line of a text-Pyspark


I have created an rdd with this form in PySpark:

[(0, ('This', 1)), (0, ('is', 1)), (0, ('the', 1)), (0, ('100th', 1)), (0, ('Etext', 1)), (0, ('file', 1)), (0, ('presented', 1)), (0, ('by', 1)), (0, ('Project', 1)), (0, ('Gutenberg,', 1)), (0, ('and', 1)), (1, ('is', 1)), (1, ('presented', 1)), (1, ('in', 1)), (1, ('cooperation', 1)), (1, ('with', 1)), (1, ('World', 1)), (1, ('Library,', 1)), (1, ('Inc.,', 1)), (1, ('from', 1))]

For the first item this is located in the first row(0). I have added 1 on it's right to get the frequency.

I cannot find a way to solve this problem. The output I am expecting after the use of aggregateByKey or reduceByKey is for example: In line zero the word This was used 1 time etc...

[(0, ('This', 1, 'is', 1, 'the', 1, ...)), ...]

Solution

  • The find the number of occurrences of word in every line and combine them together:

    1. Map the elements in the RDD such that the line number and the word becomes the key (i.e, ), (0, ('This', 1)) becomes ((0, 'This'), 1)
    2. ReduceByKey the RDD from step 1 by summing the number of occurences
    3. Remap the results from step 2 to make the line number as key
    4. ReduceByKey to combine the tuple of (word, total line count)
    rdd = spark.sparkContext.parallelize([(0, ('a', 1)), (0, ('b', 1)), (0, ('a', 1)), (1, ('a', 1))])
    
    occurences_per_line = rdd.map(lambda x: ((x[0], x[1][0]), x[1][1])).reduceByKey(lambda x, y: x + y)
    
    occurences_per_line.map(lambda x: (x[0][0], (x[0][1], x[1]))).reduceByKey(lambda x, y: x + y).collect()
    
    """
    [(0, ('a', 2, 'b', 1)), (1, ('a', 1))]
    """