Search code examples
apache-sparkpysparkdatabricksazure-databricksspark-structured-streaming

Structured Streaming with Apache Spark coded in Spark.SQL


Streaming transformations in Apache Spark with Databricks is usually coded in either Scala or Python. However, can someone let me know if it's also possible to code Streaming in SQL on Delta?

For example for the following sample code uses PySpark for structured streaming, can you let me know what would be the equivalent in spark.SQL

simpleTransform = streaming.withColumn(" stairs", expr(" gt like '% stairs%'"))\ 
.where(" stairs")\ 
.where(" gt is not null")\ 
.select(" gt", "model", "arrival_time", "creation_time")\ 
.writeStream\ 
.queryName(" simple_transform")\ 
.format(" memory")\ 
.outputMode("update")\ 
.start()

Solution

  • You can just register that streaming DF as a temporary view, and perform queries on it. For example (using rate source just for simplicity):

    df=spark.readStream.format("rate").load()
    df.createOrReplaceTempView("my_stream")
    

    then you can just perform SQL queries directly on that view, like, select * from my_stream:

    enter image description here

    Or you can create another view, applying whatever transformations you need. For example, we can select only every 5th value if we use this SQL statement:

    create or replace temp view my_derived as 
    select * from my_stream where (value % 5) == 0
    

    and then query that view with select * from my_derived:

    enter image description here