Search code examples
apache-sparkpysparkdatabricksspark-structured-streaming

Why dropping or selecting columns is not working properly with Spark Structured Streaming?


I have the following code and result. Here, I am using Databricks' autoloader.

enter image description here

The result I am getting is not correct, because if I don't drop the columns (df2), I have the following result.

enter image description here

Note that I notice similar behavior with select. What mistake am I doing here?


Solution

  • I have found the problem. I need to explicitly specify that the first line is a header. So, I changed the releavent line to this,

    df = (spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv")
          .option("header", "true").schema(schema).load("/FileStore/tables/movies7"))