python apache-spark pyspark azure-synapse

PySpark, pyspark.sql.DataFrame.foreachPartition example does not work

I am running an Apache Spark Cluster within Azure Synapse and I'm currently checkin for a way to perform the same operation for each partition. To understand the spark function foreachPartition I started to execute the example code.

However, the example:

df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])

def func(itr):
    for person in itr:
        print(person.name)

df.foreachPartition(func)

does not output anything:

I thought it has something to do with the inner function and the print, so I tried to add it to a list and output the list at the end, but it's just empty.

Can someone explain why this is not working or where the actual problem is?

Spark Version is: 3.3.1.5.2-92314920

Solution

When running on a cluster stdout is the executors log's not the driver nodes. Most notebooks take a redirect from the driver node's stdout, but not the executors.

You need to look at the executor logs from the Spark UI (no idea if Synapse has another ui for this) or instead of using foreachPartition, try using mapPartitions instead to return something to show on the driver node.