Search code examples
javascalaapache-sparkdatabricksazure-databricks

How to set the timestamp format when reading CSV in Spark +2.4


I have a Databricks 5.3 cluster on Azure which runs Apache Spark 2.4.0 and Scala 2.11.

I'm trying to parse a CSV file with a custom timestamp format but I don't know which datetime pattern format Spark uses.

My CSV looks like this:

Timestamp, Name, Value
02/07/2019 14:51:32.869-08:00, BatteryA, 0.25
02/07/2019 14:55:45.343-08:00, BatteryB, 0.50
02/07/2019 14:58:25.845-08:00, BatteryC, 0.34

I'm executing the following to read it:

%scala

val csvDataFrame = sqlContext.read.format("csv")
  .option("header", "true")
  .option("treatEmptyValuesAsNulls", "true")
  .option("inferSchema", "true")
  .option("mode","DROPMALFORMED")
  .option("timestampFormat", "MM/dd/yyyy HH:mm:ss.SSSZZ")
  .load("path/to/file.csv")

csvDataFrame.printSchema()

But no matter what timestamp pattern I use, the first column is always inferred as string.

csvDataFrame:org.apache.spark.sql.DataFrame
  Timestamp:string
  Name:string
  Value:double

I'm not a Java/Scala developer and I'm new to Spark/Databricks. I can't find anywhere which datetime formatter does Spark use to parse the values.

Patterns I tried:

MM/dd/yyyy HH:mm:ss.SSSZZ
MM/dd/yyyy HH:mm:ss.SSSXXX

Solution

  • It turns out that the Databricks docs mention a dateFormat option when reading/writing CSV files, but it doesn't do anything. The correct option is the same Spark uses, called timestampFormat.

    My problem was caused by a malformed line in my file. Even with DROPMALFORMED, that line was causing the timestamp to be interpreted as string.