I am trying to replicate the given code to see how foreach
works, I tried with the following code:
rdd = sc.parallelize([1,2,3,4,5])
def f(a):
print(a)
rdd.collect().foreach(f)
But it gives the following error:
AttributeError: 'list' object has no attribute 'foreach'
I understand the error that return type of collect()
is a array
(which is list) and it doesn't have foreach
attribute associated with it but, I don't understand how this doesn't work if it's given in the official spark 3.0.1
documentation. What am I missing. I am using Spark 3.0.1
rdd.collect()
is a Spark action that will collect the data to the driver. The collect
method returns a list, therefore if you want to print out the list's elements just do:
rdd = sc.parallelize([1,2,3,4,5])
def f(a):
print(a)
res = rdd.collect()
[f(e) for e in res]
# output
# 1
# 2
# 3
# 4
# 5
An alternative would be to use another action foreach
as mentioned in your example, for instance rdd.foreach(f)
. Although, because of the distributed nature of Spark, there is no guarantee that the print
command will send the data to the driver's output stream. This essentially means that you could never see this data printed out.
Why is that? Because the Spark has two deployment modes, cluster
and client
mode. In cluster mode, the driver run in an arbitrary node in the cluster. In client mode on the other side, the driver runs in the machine from which you submitted the Spark job, consequently, Spark driver and your program share the same output stream. Hence you should always be able to see the output of the program.
Related links