Search code examples
sqlapache-sparkcase-when

add label with case when by using timerange with spark sql?


I have this table where shows ID and also timestamp. I want to add label for each timestamp range.

ID          timestamp
a       2020-01-16 08:55:50
b       2020-01-16 08:57:37
c       2020-01-16 09:00:13
d       2020-01-16 09:01:32
e       2020-01-16 09:03:32
f       2020-01-16 09:06:56

for example, from 2020-01-16 08:55:50 to 2020-01-16 09:00:13 is X, and from 2020-01-16 09:01:32 to 2020-01-16 09:06:56 is Y.

I expect the table will show:

ID        timestamp                type_flag
a       2020-01-16 08:55:50          X
b       2020-01-16 08:57:37          X
c       2020-01-16 09:00:13          X
d       2020-01-16 09:01:32          Y
e       2020-01-16 09:03:32          Y
f       2020-01-16 09:06:56          Y
g       2020-01-16 09:08:51          Z
h       2020-01-16 09:10:43          Z
i       2020-01-16 09:13:21          Z

so far, what I have tried:

SELECT *,
    CASE WHEN timestamp BETWEEN '2020-01-16 08:55:50' AND '2020-01-16 09:00:13' THEN 'X' 
         WHEN timestamp BETWEEN '2020-01-16 09:01:32' and '2020-01-16 09:06:56' THEN 'Y'
         WHEN timestamp BETWEEN '2020-01-16 09:08:51' and '2020-01-16 09:13:21' THEN 'Z'
    ELSE 'A' END AS type_flag
FROM table1;

but it gave me an error saying:

Error [22P02]: ERROR: invalid input syntax for integer: "2021-01-16 08:55:50"
  Position: 37

How should I fix my query to get my desired result? I use spark sql for this

Thanks.


Solution

  • I think there is something wrong with your syntax or how you are doing your transformation.

    //creating sample data
    val df = Seq(("a","2020-01-16 08:55:50"),("b","2020-01-16 08:57:37"),("c","2020-01-16 09:00:13"),("d","2020-01-16 09:01:32"),("e","2020-01-16 09:03:32"),("f","2020-01-16 09:06:56")).toDF("ID","timestamp")
    //changing the data type of the timestamp column from string to timestamp
    import org.apache.spark.sql.types._
    val df1 = df.withColumn("timestamp",$"timestamp".cast("TimeStamp"))
    //creating a view so that I can query it using spark sql
    df1.createOrReplaceTempView("timestamptest")
    //case when statements inside the spark sql
    val df3 = spark.sql("""select *, CASE WHEN timestamp BETWEEN '2020-01-16 08:55:50' AND '2020-01-16 09:00:13' THEN 'X' 
             WHEN timestamp BETWEEN '2020-01-16 09:01:32' and '2020-01-16 09:06:56' THEN 'Y'
             WHEN timestamp BETWEEN '2020-01-16 09:08:51' and '2020-01-16 09:13:21' THEN 'Z'
        ELSE 'A' END As type_flag from timestamptest""")
    display(df3)
    

    You can see the output as below :

    enter image description here