I have this table where shows ID and also timestamp. I want to add label for each timestamp range.
ID timestamp
a 2020-01-16 08:55:50
b 2020-01-16 08:57:37
c 2020-01-16 09:00:13
d 2020-01-16 09:01:32
e 2020-01-16 09:03:32
f 2020-01-16 09:06:56
for example, from 2020-01-16 08:55:50 to 2020-01-16 09:00:13 is X, and from 2020-01-16 09:01:32 to 2020-01-16 09:06:56 is Y.
I expect the table will show:
ID timestamp type_flag
a 2020-01-16 08:55:50 X
b 2020-01-16 08:57:37 X
c 2020-01-16 09:00:13 X
d 2020-01-16 09:01:32 Y
e 2020-01-16 09:03:32 Y
f 2020-01-16 09:06:56 Y
g 2020-01-16 09:08:51 Z
h 2020-01-16 09:10:43 Z
i 2020-01-16 09:13:21 Z
so far, what I have tried:
SELECT *,
CASE WHEN timestamp BETWEEN '2020-01-16 08:55:50' AND '2020-01-16 09:00:13' THEN 'X'
WHEN timestamp BETWEEN '2020-01-16 09:01:32' and '2020-01-16 09:06:56' THEN 'Y'
WHEN timestamp BETWEEN '2020-01-16 09:08:51' and '2020-01-16 09:13:21' THEN 'Z'
ELSE 'A' END AS type_flag
FROM table1;
but it gave me an error saying:
Error [22P02]: ERROR: invalid input syntax for integer: "2021-01-16 08:55:50"
Position: 37
How should I fix my query to get my desired result? I use spark sql for this
Thanks.
I think there is something wrong with your syntax or how you are doing your transformation.
//creating sample data
val df = Seq(("a","2020-01-16 08:55:50"),("b","2020-01-16 08:57:37"),("c","2020-01-16 09:00:13"),("d","2020-01-16 09:01:32"),("e","2020-01-16 09:03:32"),("f","2020-01-16 09:06:56")).toDF("ID","timestamp")
//changing the data type of the timestamp column from string to timestamp
import org.apache.spark.sql.types._
val df1 = df.withColumn("timestamp",$"timestamp".cast("TimeStamp"))
//creating a view so that I can query it using spark sql
df1.createOrReplaceTempView("timestamptest")
//case when statements inside the spark sql
val df3 = spark.sql("""select *, CASE WHEN timestamp BETWEEN '2020-01-16 08:55:50' AND '2020-01-16 09:00:13' THEN 'X'
WHEN timestamp BETWEEN '2020-01-16 09:01:32' and '2020-01-16 09:06:56' THEN 'Y'
WHEN timestamp BETWEEN '2020-01-16 09:08:51' and '2020-01-16 09:13:21' THEN 'Z'
ELSE 'A' END As type_flag from timestamptest""")
display(df3)
You can see the output as below :