Search code examples
pythonapache-sparkpysparkrddword-count

How to find the frequency of a word in a line, in a text file - Pyspark


I have managed to make an RDD (in Pyspark) that looks like this:

[('This', (0, 1)), ('is', (0, 1)), ('the', (0, 1)), ('100th', (0, 1))...]

I used the following code: RDD=sc.textFile(_filepath_)

test1 = RDD.zipWithIndex().flatMap(lambda x: ((i,(x[1],1)) for i in x[0].split(" ")))

Practically, [(word, (line, freq)] so the above words are from the 1st line in the file (hence the 0) and freq is 1 for all words in the text, and I want it to count the times this word appears on this specific line, for the entire RDD. I thought of .reduceByKey(lambda x, y: x + y) but when I execute an action like .take(5) after that, it freezes (Ubuntu terminal - Oracle VirtualBox with plenty of RAM/disk space, if it helps).

What I need is basically, if the word 'This' is in first line and it's there 7 times, then the result will be [('This', (0, 7)), ...]


Solution

  • Solved it, but the answer may not be optimal.

    RDD = sc.textFile(_filepath_) 
    test1 = RDD.zipWithIndex().flatMap(lambda x: ((i,(x[1],1)) for i in x[0].split(" "))) 
    test2 = test1.map(lambda x: ((x[0], x[1][0]), x[1][1])).reduceByKey(lambda x, y: x + y) 
    Result_RDD = test2.map(lambda x: (x[0][0], (x[0][1], x[1])))