Search code examples
apache-sparkapache-spark-sqltimestampmillisecondsformat-string

spark sql string to timestamp missing milliseconds


Why is:

import spark.implicits._
  val content = Seq(("2019", "09", "11","17","16","54","762000000")).toDF("year", "month", "day", "hour", "minute", "second", "nano")
  content.printSchema
  content.show
  content.withColumn("event_time_utc", to_timestamp(concat('year, 'month, 'day, 'hour, 'minute, 'second), "yyyyMMddHHmmss"))
    .withColumn("event_time_utc_millis", to_timestamp(concat('year, 'month, 'day, 'hour, 'minute, 'second, substring('nano, 0, 3)), "yyyyMMddHHmmssSSS"))
    .select('year, 'month, 'day, 'hour, 'minute, 'second, 'nano,substring('nano, 0, 3), 'event_time_utc, 'event_time_utc_millis)
    .show

missing the milliseconds?

+----+-----+---+----+------+------+---------+---------------------+-------------------+---------------------+
|year|month|day|hour|minute|second|     nano|substring(nano, 0, 3)|     event_time_utc|event_time_utc_millis|
+----+-----+---+----+------+------+---------+---------------------+-------------------+---------------------+
|2019|   09| 11|  17|    16|    54|762000000|                  762|2019-09-11 17:16:54|  2019-09-11 17:16:54|
+----+-----+---+----+------+------+---------+---------------------+-------------------+---------------------+

for a format string of: yyyyMMddHHmmssSSS which should include the milliseconds in SSS if I am not mistaken.


Solution

  • I have faced similar problem, Official Document says below line till spark <2.4:

    Convert time string to a Unix timestamp (in seconds) with a specified format (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix timestamp (in seconds), return null if fail.

    Which means that it handles seconds only.

    Spark>= 2.4 can handle SSS as well.

    Solution: Below UDF will help to handle this scenario:

    import java.text.SimpleDateFormat
    import java.sql.Timestamp
    import org.apache.spark.sql.functions._
    import scala.util.{Try, Success, Failure}
    
    val getTimestampWithMilis: ((String , String) => Option[Timestamp]) = (input, frmt) => input match {
      case "" => None
      case _ => {
        val format = new SimpleDateFormat(frmt)
        Try(new Timestamp(format.parse(input).getTime)) match {
          case Success(t) => Some(t)
          case Failure(_) => None
        }    
      }
    }
    
    val getTimestampWithMilisUDF = udf(getTimestampWithMilis)
    
    

    For your Example:

    val content = Seq(("2019", "09", "11","17","16","54","762000000")).toDF("year", "month", "day", "hour", "minute", "second", "nano")
    val df = content.withColumn("event_time_utc", concat('year, 'month, 'day, 'hour, 'minute, 'second, substring('nano, 0, 3)))
    df.show
    +----+-----+---+----+------+------+---------+-----------------+
    |year|month|day|hour|minute|second|     nano|   event_time_utc|
    +----+-----+---+----+------+------+---------+-----------------+
    |2019|   09| 11|  17|    16|    54|762000000|20190911171654762|
    +----+-----+---+----+------+------+---------+-----------------+
    
    df.withColumn("event_time_utc_millis", getTimestampWithMilisUDF($"event_time_utc", lit("yyyyMMddHHmmssSSS"))).show(1, false)
    +----+-----+---+----+------+------+---------+-----------------+-----------------------+
    |year|month|day|hour|minute|second|nano     |event_time_utc   |event_time_utc_millis  |
    +----+-----+---+----+------+------+---------+-----------------+-----------------------+
    |2019|09   |11 |17  |16    |54    |762000000|20190911171654762|2019-09-11 17:16:54.762|
    +----+-----+---+----+------+------+---------+-----------------+-----------------------+
    
    root
     |-- year: string (nullable = true)
     |-- month: string (nullable = true)
     |-- day: string (nullable = true)
     |-- hour: string (nullable = true)
     |-- minute: string (nullable = true)
     |-- second: string (nullable = true)
     |-- nano: string (nullable = true)
     |-- event_time_utc: string (nullable = true)
     |-- event_time_utc_millis: timestamp (nullable = true)