Search code examples
scalaapache-sparkapache-spark-sqlsql-timestamp

How to get full timestamp value from dataframes? values being truncated


I have a function "toDate(v:String):Timestamp" that takes a string an converts it into a timestamp with the format "MM-DD-YYYY HH24:MI:SS.NS".

I make a udf of the function:

val u_to_date = sqlContext.udf.register("u_to_date", toDate_)

The issue happens when you apply the UDF to dataframes. The resulting dataframe will lose the last 3 nanoseconds. For example when using the argument "0001-01-01 00:00:00.123456789" The resulting dataframe will be in the format [0001-01-01 00:00:00.123456]

I have even tried a dummy function that returns Timestamp.valueOf("1234-01-01 00:00:00.123456789"). When applying the udf of the dummy function, it will truncate the last 3 nanoseconds.

I have looked into the sqlContext conf and spark.sql.parquet.int96AsTimestamp is set to True. (I tried when it's set to false)

I am at lost here. What is causing the truncation of the last 3 digits?


example

The function could be: def date123(v: String): Timestamp = { Timestamp.valueOf("0001-01-01 00:00:00.123456789") } It's just a dummy function that should return a timestamp with full nanosecond precision.

Then I would make a udf:

`val u_date123 = sqlContext.udf.register("u_date123", date123 _)`

example df:

val theRow =Row("blah") val theRdd = sc.makeRDD(Array(theRow)) case class X(x: String ) val df = theRdd.map{case Row(s0) => X(s0.asInstanceOf[String])}.toDF()

If I apply the udf to the dataframe df with a string column, it will return a dataframe that looks like '[0001-01-01 00:00:00.123456]'

df.select(u_date123($"x")).collect.foreach(println)


Solution

  • I think I found the issue.

    On spark 1.5.1, they changed the size of the timestamp datatype from 12 bytes to 8 bytes

    https://fossies.org/diffs/spark/1.4.1_vs_1.5.0/sql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampType.scala-diff.html

    I tested on spark 1.4.1, and it produces the full nanosecond precision.