Search code examples
apache-sparkdataframeapache-spark-sqlrdd

How to sort a column with Date and time values in Spark?


Note: I have this as a Dataframe in spark. This Time/Date values constitute a single column in the Dataframe.

Input:

04-NOV-16 03.36.13.000000000 PM
06-NOV-15 03.42.21.000000000 PM
05-NOV-15 03.32.05.000000000 PM
06-NOV-15 03.32.14.000000000 AM

Expected Output:

05-NOV-15 03.32.05.000000000 PM
06-NOV-15 03.32.14.000000000 AM
06-NOV-15 03.42.21.000000000 PM
04-NOV-16 03.36.13.000000000 PM

Solution

  • As this format is not standard, you need to use the unix_timestamp function to parse the string and convert into a timestamp type:

    import org.apache.spark.sql.functions._
    
    // Example data
    val df = Seq(
      Tuple1("04-NOV-16 03.36.13.000000000 PM"),
      Tuple1("06-NOV-15 03.42.21.000000000 PM"),
      Tuple1("05-NOV-15 03.32.05.000000000 PM"),
      Tuple1("06-NOV-15 03.32.14.000000000 AM")
    ).toDF("stringCol")
    
    // Timestamp pattern found in string
    val pattern = "dd-MMM-yy hh.mm.ss.S a"
    
    // Creating new DataFrame and ordering
    val newDF = df
      .withColumn("timestampCol", unix_timestamp(df("stringCol"), pattern).cast("timestamp"))
      .orderBy("timestampCol")
    
    newDF.show(false)
    

    Result:

    +-------------------------------+---------------------+
    |stringCol                      |timestampCol         |
    +-------------------------------+---------------------+
    |05-NOV-15 03.32.05.000000000 PM|2015-11-05 15:32:05.0|
    |06-NOV-15 03.32.14.000000000 AM|2015-11-06 03:32:14.0|
    |06-NOV-15 03.42.21.000000000 PM|2015-11-06 15:42:21.0|
    |04-NOV-16 03.36.13.000000000 PM|2016-11-04 15:36:13.0|
    +-------------------------------+---------------------+
    

    More about the unix_timestamp and other utility functions can be found here.

    For building the timestamp format, one can refer to the SimpleDateFormatter docs


    Edit 1: as said by pheeleeppoo, you could order directly by the expression, instead of creating a new column, assuming you want to keep only the string-typed column in your dataframe:

    val newDF = df.orderBy(unix_timestamp(df("stringCol"), pattern).cast("timestamp"))
    

    Edit 2: Please note that the precision of the unix_timestamp function is in seconds, so if the milliseconds are really important, an udf can be used:

    def myUDF(p: String) = udf(
      (value: String) => {
        val dateFormat = new SimpleDateFormat(p)
        val parsedDate = dateFormat.parse(value)
        new java.sql.Timestamp(parsedDate.getTime())
      }
    )
    
    val pattern = "dd-MMM-yy hh.mm.ss.S a"
    val newDF = df.withColumn("timestampCol", myUDF(pattern)(df("stringCol"))).orderBy("timestampCol")