Search code examples
pythonapache-sparkpysparkazure-synapse

PySpark, pyspark.sql.DataFrame.foreachPartition example does not work


I am running an Apache Spark Cluster within Azure Synapse and I'm currently checkin for a way to perform the same operation for each partition. To understand the spark function foreachPartition I started to execute the example code.

However, the example:

df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])

def func(itr):
    for person in itr:
        print(person.name)

df.foreachPartition(func)

does not output anything: enter image description here

I thought it has something to do with the inner function and the print, so I tried to add it to a list and output the list at the end, but it's just empty.

enter image description here

Can someone explain why this is not working or where the actual problem is?

Spark Version is: 3.3.1.5.2-92314920


Solution

  • When running on a cluster stdout is the executors log's not the driver nodes. Most notebooks take a redirect from the driver node's stdout, but not the executors.

    You need to look at the executor logs from the Spark UI (no idea if Synapse has another ui for this) or instead of using foreachPartition, try using mapPartitions instead to return something to show on the driver node.