Search code examples
javaapache-sparkgroup-byapache-spark-dataset

Group spark Dataset by month from a timestamp in java


I've loaded all rows from a table into Dataset using spark session in java. I want to get the count of rows in each month.

I tried to create new column of month by using withColumn() so that I can later use group_by month and count(). But I am not able to get month from timestamp. How can I find the count in each month from above dataset?

My sample Dataset will look like this,

enter image description here


Solution

  • Considering the way you have explained your problem: I tried to create new column of month by using withColumn() so that I can later use groupBy() month and count(). But I am not able to get month from timestamp.

    You can you the static month()function provided in org.apache.spark.sql.functions package to find the month, as below:

    myDataset.withColumn("month", month(col("date"))).groupBy(col("month")).count().show()
    

    where col("date") will have the timestamp (in below case : "yyyy-mm-dd HH:mm:ss" ).

    Input used:

    1,2019-04-07  07:24:14,0,8
    
    2,2019-05-07  07:24:14,0,10
    
    5,2019-06-07  07:24:14,0,6
    
    3,2019-04-07  07:24:14,0,7
    

    This will give you an output as below:

    +-----+-----+
    |month|count|
    +-----+-----+
    |    6|    1|
    |    5|    1|
    |    4|    2|
    +-----+-----+
    

    Hope this helps!!