I have a dataframe with content like below:
scala> patDF.show
+---------+-------+-----------+-------------+
|patientID| name|dateOtBirth|lastVisitDate|
+---------+-------+-----------+-------------+
| 1001|Ah Teck| 1991-12-31| 2012-01-20|
| 1002| Kumar| 2011-10-29| 2012-09-20|
| 1003| Ali| 2011-01-30| 2012-10-21|
+---------+-------+-----------+-------------+
all the columns are string
I want to get the list of records with lastVisitDate falling in the range of format of yyyy-mm-dd and now, so here is the script:
patDF.registerTempTable("patients")
val results2 = sqlContext.sql("SELECT * FROM patients WHERE from_unixtime(unix_timestamp(lastVisitDate, 'yyyy-mm-dd')) between '2012-09-15' and current_timestamp() order by lastVisitDate")
results2.show()
It gets me nothing, presumably, there should be records with patientID of 1002 and 1003.
So I modified the query to:
val results3 = sqlContext.sql("SELECT from_unixtime(unix_timestamp(lastVisitDate, 'yyyy-mm-dd')), * FROM patients")
results3.show()
Now I get:
+-------------------+---------+-------+-----------+-------------+
| _c0|patientlD| name|dateOtBirth|lastVisitDate|
+-------------------+---------+-------+-----------+-------------+
|2012-01-20 00:01:00| 1001|Ah Teck| 1991-12-31| 2012-01-20|
|2012-01-20 00:09:00| 1002| Kumar| 2011-10-29| 2012-09-20|
|2012-01-21 00:10:00| 1003| Ali| 2011-01-30| 2012-10-21|
+-------------------+---------+-------+-----------+-------------+
If you look at the first column, you will see all the months were somehow changed to 01
What's wrong with the code?
The correct format for year-month-day
should be yyyy-MM-dd
:
val patDF = Seq(
(1001, "Ah Teck", "1991-12-31", "2012-01-20"),
(1002, "Kumar", "2011-10-29", "2012-09-20"),
(1003, "Ali", "2011-01-30", "2012-10-21")
)toDF("patientID", "name", "dateOtBirth", "lastVisitDate")
patDF.createOrReplaceTempView("patTable")
val result1 = spark.sqlContext.sql("""
select * from patTable where to_timestamp(lastVisitDate, 'yyyy-MM-dd')
between '2012-09-15' and current_timestamp() order by lastVisitDate
""")
result1.show
// +---------+-----+-----------+-------------+
// |patientID| name|dateOtBirth|lastVisitDate|
// +---------+-----+-----------+-------------+
// | 1002|Kumar| 2011-10-29| 2012-09-20|
// | 1003| Ali| 2011-01-30| 2012-10-21|
// +---------+-----+-----------+-------------+
You can also use DataFrame API, if wanted:
val result2 = patDF.where(to_timestamp($"lastVisitDate", "yyyy-MM-dd").
between(to_timestamp(lit("2012-09-15"), "yyyy-MM-dd"), current_timestamp())
).orderBy($"lastVisitDate")