Search code examples
pythonapache-sparkpysparkrdd

How do I count the number of occurrences in a spark RDD and return it as a dictionary?


I am loading a csv file as a dataframe and I converted it to a RDD. This RDD contains the list of cities like [NY,NY,NY,LA,LA,LA,Detroit,Miami]. I want to be able to extract the frequency of each city like so:

NY: 3 LA: 3 Detroit: 1 Miami: 1

I know I can do this with dataframe functions, but I need to specifically do this with RDD functions like map,filter, etc.

Here's what I tried so far:

        df= spark.read.format("csv").option("header", "true").load(filename)
        newRDD = df.rdd.map(lambda x: x[6]).filter(lambda x: x!=None)

I am just getting the 6th column in the above code in my dataframe that contains the cities


Solution

  • You can try reduceByKey.

    >>> df = spark.createDataFrame(["NY","NY","NY","LA","LA","LA","Detroit","Miami"], StringType())
    >>> rdd2 = df.rdd.map(lambda x: (x[0],1)).reduceByKey(lambda x, y: x+y)
    >>> rdd2.collect()
    
    
    [('Detroit', 1), ('NY', 3), ('LA', 3), ('Miami', 1)]