I would like to make a union
operation on multiple structured streaming dataframe, connected to kafka topics, in order to watermark them all at the same moment.
For instance:
df1=socket_streamer(spark,topic1)
df2=socket_streamer(spark,topic2)
where spark=sparksession and socket_streamer = spark.readstream
then i'll do:
Dataframe=df1.union(df2)
Dataframe=Dataframe.withWatermark("timestamp","5 minutes")
then I try to writeStream Dataframe.
The issue is: the union
displays only the first df
to receive rows.
Do you have any idea, to get all my data received by the union or how can I apply a same watermark on multiple dataframes ?
Tank you !
Does df1 and df2 have the same structure? Union function in spark resolves columns by position (not by name).
To union by name, use:
df1.unionByName(df2, allowMissingColumns=True)
(available from Spark 3.1.X)