I have a Databricks 5.3 cluster on Azure which runs Apache Spark 2.4.0 and Scala 2.11.
I'm trying to parse a CSV file with a custom timestamp format but I don't know which datetime pattern format Spark uses.
My CSV looks like this:
Timestamp, Name, Value
02/07/2019 14:51:32.869-08:00, BatteryA, 0.25
02/07/2019 14:55:45.343-08:00, BatteryB, 0.50
02/07/2019 14:58:25.845-08:00, BatteryC, 0.34
I'm executing the following to read it:
%scala
val csvDataFrame = sqlContext.read.format("csv")
.option("header", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("mode","DROPMALFORMED")
.option("timestampFormat", "MM/dd/yyyy HH:mm:ss.SSSZZ")
.load("path/to/file.csv")
csvDataFrame.printSchema()
But no matter what timestamp pattern I use, the first column is always inferred as string.
csvDataFrame:org.apache.spark.sql.DataFrame
Timestamp:string
Name:string
Value:double
I'm not a Java/Scala developer and I'm new to Spark/Databricks. I can't find anywhere which datetime formatter does Spark use to parse the values.
Patterns I tried:
MM/dd/yyyy HH:mm:ss.SSSZZ
MM/dd/yyyy HH:mm:ss.SSSXXX
It turns out that the Databricks docs mention a dateFormat
option when reading/writing CSV files, but it doesn't do anything.
The correct option is the same Spark uses, called timestampFormat
.
My problem was caused by a malformed line in my file. Even with DROPMALFORMED, that line was causing the timestamp to be interpreted as string.