Search code examples
pysparktext-filesword-count

How to compute the total number of words in a text file


I am given a text file (call it text.txt). I need to count the total number of words (counting repetitions as well). My code begins like this:

def words():
    f = sc.textFile("text.txt")
    return f.DO_SOME_MAGIC()

So my question reduces to: What should go to DO_SOME_MAGIC?

PS

For the following text file:

hello world
bye world

I should receive 4 and NOT:

(hello, 1)
(bye, 1)
(world, 2)

Solution

  • Try this will work fine

    def words():
        f = sc.textFile("text.txt")
        return f.flatMap(lambda line: line.split()).count()
    

    without repetition

    def words():
        f = sc.textFile("text.txt")
        return f.flatMap(lambda line: line.split()).distinct().count()