Search code examples
pysparkapache-kafka-streamsspark-structured-streaming

How can i do "union" on multiple spark structured streaming dataframes?


I would like to make a union operation on multiple structured streaming dataframe, connected to kafka topics, in order to watermark them all at the same moment.

For instance:

df1=socket_streamer(spark,topic1)
df2=socket_streamer(spark,topic2)

where spark=sparksession and socket_streamer = spark.readstream

then i'll do:

Dataframe=df1.union(df2)
Dataframe=Dataframe.withWatermark("timestamp","5 minutes")

then I try to writeStream Dataframe.

The issue is: the union displays only the first df to receive rows.

Do you have any idea, to get all my data received by the union or how can I apply a same watermark on multiple dataframes ?

Tank you !


Solution

  • Does df1 and df2 have the same structure? Union function in spark resolves columns by position (not by name).

    To union by name, use:

    df1.unionByName(df2, allowMissingColumns=True)
    

    (available from Spark 3.1.X)