java apache-spark group-by apache-spark-dataset

Group spark Dataset by month from a timestamp in java

I've loaded all rows from a table into Dataset using spark session in java. I want to get the count of rows in each month.

I tried to create new column of month by using withColumn() so that I can later use group_by month and count(). But I am not able to get month from timestamp. How can I find the count in each month from above dataset?

My sample Dataset will look like this,

Solution

Considering the way you have explained your problem: I tried to create new column of month by using withColumn() so that I can later use groupBy() month and count(). But I am not able to get month from timestamp.

You can you the static month()function provided in org.apache.spark.sql.functions package to find the month, as below:

myDataset.withColumn("month", month(col("date"))).groupBy(col("month")).count().show()

where col("date") will have the timestamp (in below case : "yyyy-mm-dd HH:mm:ss" ).

Input used:

1,2019-04-07  07:24:14,0,8

2,2019-05-07  07:24:14,0,10

5,2019-06-07  07:24:14,0,6

3,2019-04-07  07:24:14,0,7

This will give you an output as below:

+-----+-----+
|month|count|
+-----+-----+
|    6|    1|
|    5|    1|
|    4|    2|
+-----+-----+

Hope this helps!!