I have a function "toDate(v:String):Timestamp" that takes a string an converts it into a timestamp with the format "MM-DD-YYYY HH24:MI:SS.NS".
I make a udf of the function:
val u_to_date = sqlContext.udf.register("u_to_date", toDate_)
The issue happens when you apply the UDF to dataframes
. The resulting dataframe
will lose the last 3 nanoseconds.
For example when using the argument "0001-01-01 00:00:00.123456789"
The resulting dataframe will be in the format
[0001-01-01 00:00:00.123456]
I have even tried a dummy function that returns Timestamp.valueOf("1234-01-01 00:00:00.123456789")
. When applying the udf
of the dummy function, it will truncate the last 3 nanoseconds.
I have looked into the sqlContext conf and spark.sql.parquet.int96AsTimestamp is set to True. (I tried when it's set to false)
I am at lost here. What is causing the truncation of the last 3 digits?
example
The function could be:
def date123(v: String): Timestamp = {
Timestamp.valueOf("0001-01-01 00:00:00.123456789")
}
It's just a dummy function that should return a timestamp with full nanosecond precision.
Then I would make a udf:
`val u_date123 = sqlContext.udf.register("u_date123", date123 _)`
example df:
val theRow =Row("blah")
val theRdd = sc.makeRDD(Array(theRow))
case class X(x: String )
val df = theRdd.map{case Row(s0) => X(s0.asInstanceOf[String])}.toDF()
If I apply the udf to the dataframe df
with a string column, it will return a dataframe that looks like '[0001-01-01 00:00:00.123456]'
df.select(u_date123($"x")).collect.foreach(println)
I think I found the issue.
On spark 1.5.1, they changed the size of the timestamp datatype from 12 bytes to 8 bytes
I tested on spark 1.4.1, and it produces the full nanosecond precision.