Search code examples
apache-sparkrdd

'list' object has no attribute 'foreach'


I am trying to replicate the given code to see how foreach works, I tried with the following code:

rdd = sc.parallelize([1,2,3,4,5])

def f(a):
    print(a)

rdd.collect().foreach(f)

But it gives the following error:

AttributeError: 'list' object has no attribute 'foreach'

I understand the error that return type of collect() is a array(which is list) and it doesn't have foreach attribute associated with it but, I don't understand how this doesn't work if it's given in the official spark 3.0.1 documentation. What am I missing. I am using Spark 3.0.1


Solution

  • rdd.collect() is a Spark action that will collect the data to the driver. The collect method returns a list, therefore if you want to print out the list's elements just do:

    rdd = sc.parallelize([1,2,3,4,5])
    
    def f(a):
        print(a)
    
    res = rdd.collect()
    
    [f(e) for e in res]
    
    # output
    # 1
    # 2
    # 3
    # 4
    # 5
    
    

    An alternative would be to use another action foreach as mentioned in your example, for instance rdd.foreach(f). Although, because of the distributed nature of Spark, there is no guarantee that the print command will send the data to the driver's output stream. This essentially means that you could never see this data printed out.

    Why is that? Because the Spark has two deployment modes, cluster and client mode. In cluster mode, the driver run in an arbitrary node in the cluster. In client mode on the other side, the driver runs in the machine from which you submitted the Spark job, consequently, Spark driver and your program share the same output stream. Hence you should always be able to see the output of the program.

    Related links