Search code examples
apache-sparkforeachpysparkrdd

RDD foreach method provides no results


I am trying to understand how foreach method works. In my jupyter notebook, I tried :

def f(x): print(x)
a = sc.parallelize([1, 2, 3, 4, 5])
b = a.foreach(f)
print(type(b))
<class 'NoneType'>

I can execute that without any problem, but I don't have any output except the print(type(b)) part. The foreach doesn't return anything, just a none type. I do not know what foreach is supposed to do, and how to use it. Can you explain me what it is used for ?


Solution

  • foreach is an action, and does not return anything; so, you cannot use it as you do, i.e. assigning it to another variable like b = a.foreach(f). From Learning Spark, p. 41-42:

    enter image description here

    enter image description here

    Adapting the simple example from the docs, run in a PySpark terminal:

    >>> def f(x): print(x)
    >>> a = sc.parallelize([1, 2, 3, 4, 5])
    >>> a.foreach(f)
    5
    4
    3
    1
    2
    

    (NOTE: not sure about Jupyter, but the above code will not produce any print results in a Databricks notebook.)

    You may also find the answers in this thread helpful.