I have the following code and result. Here, I am using Databricks' autoloader.
The result I am getting is not correct, because if I don't drop the columns (df2), I have the following result.
Note that I notice similar behavior with select
. What mistake am I doing here?
I have found the problem. I need to explicitly specify that the first line is a header. So, I changed the releavent line to this,
df = (spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv")
.option("header", "true").schema(schema).load("/FileStore/tables/movies7"))