Search code examples
scaladataframeunix-timestamp

Unexpected incorrect result after unixtime conversion in sparksql


I have a dataframe with content like below:

scala> patDF.show
+---------+-------+-----------+-------------+
|patientID|   name|dateOtBirth|lastVisitDate|
+---------+-------+-----------+-------------+
|     1001|Ah Teck| 1991-12-31|   2012-01-20|
|     1002|  Kumar| 2011-10-29|   2012-09-20|
|     1003|    Ali| 2011-01-30|   2012-10-21|
+---------+-------+-----------+-------------+

all the columns are string

I want to get the list of records with lastVisitDate falling in the range of format of yyyy-mm-dd and now, so here is the script:

patDF.registerTempTable("patients") 
val results2 = sqlContext.sql("SELECT * FROM patients WHERE from_unixtime(unix_timestamp(lastVisitDate, 'yyyy-mm-dd')) between '2012-09-15' and current_timestamp() order by lastVisitDate")
results2.show() 

It gets me nothing, presumably, there should be records with patientID of 1002 and 1003.

So I modified the query to:

val results3 = sqlContext.sql("SELECT from_unixtime(unix_timestamp(lastVisitDate, 'yyyy-mm-dd')), * FROM patients")
results3.show() 

Now I get:

+-------------------+---------+-------+-----------+-------------+
|                _c0|patientlD|   name|dateOtBirth|lastVisitDate|
+-------------------+---------+-------+-----------+-------------+
|2012-01-20 00:01:00|     1001|Ah Teck| 1991-12-31|   2012-01-20|
|2012-01-20 00:09:00|     1002|  Kumar| 2011-10-29|   2012-09-20|
|2012-01-21 00:10:00|     1003|    Ali| 2011-01-30|   2012-10-21|
+-------------------+---------+-------+-----------+-------------+

If you look at the first column, you will see all the months were somehow changed to 01

What's wrong with the code?


Solution

  • The correct format for year-month-day should be yyyy-MM-dd:

    val patDF = Seq(
      (1001, "Ah Teck", "1991-12-31", "2012-01-20"),
      (1002, "Kumar", "2011-10-29", "2012-09-20"),
      (1003, "Ali", "2011-01-30", "2012-10-21")
    )toDF("patientID", "name", "dateOtBirth", "lastVisitDate")
    
    patDF.createOrReplaceTempView("patTable")
    
    val result1 = spark.sqlContext.sql("""
      select * from patTable where to_timestamp(lastVisitDate, 'yyyy-MM-dd')
        between '2012-09-15' and current_timestamp() order by lastVisitDate
    """)
    
    result1.show
    // +---------+-----+-----------+-------------+
    // |patientID| name|dateOtBirth|lastVisitDate|
    // +---------+-----+-----------+-------------+
    // |     1002|Kumar| 2011-10-29|   2012-09-20|
    // |     1003|  Ali| 2011-01-30|   2012-10-21|
    // +---------+-----+-----------+-------------+
    

    You can also use DataFrame API, if wanted:

    val result2 = patDF.where(to_timestamp($"lastVisitDate", "yyyy-MM-dd").
        between(to_timestamp(lit("2012-09-15"), "yyyy-MM-dd"), current_timestamp())
      ).orderBy($"lastVisitDate")