Search code examples
javascalaapache-sparkexplodedate-range

In Scala, how do I create a column of date arrays of monthly dates between a start and end date?


In Spark Scala, I am trying to create a column that contains an array of monthly dates between a start and an end date (inclusive).

For example, if we have 2018-02-07 and 2018-04-28, the array should contain [2018-02-01, 2018-03-01, 2018-04-01].

Besides the monthly version I would also like to create a quarterly version, i.e. [2018-1, 2018-2].

Example Input Data:

id startDate endDate
1_1 2018-02-07 2018-04-28
1_2 2018-05-06 2018-05-31
2_1 2017-04-13 2017-04-14

Expected (monthly) Output 1:

id startDate endDate dateRange
1_1 2018-02-07 2018-04-28 [2018-02-01, 2018-03-01, 2018-04-01]
1_1 2018-05-06 2018-05-31 [2018-05-01]
2_1 2017-04-13 2017-04-14 [2017-04-01]

Ultimate expected (monthly) output 2:

id Date
1_1 2018-02-01 
1_1 2018-03-01
1_1 2018-04-01
1_2 2018-05-01
2_1 2017-04-01

I have spark 2.1.0.167, Scala 2.10.6, and JavaHotSpot 1.8.0_172.

I have tried to implement several answers to similar (day-level) questions on here, but I am struggling with getting a monthly/quarterly version to work.

The below creates an array from start and endDate and explodes it. However I need to explode a column that contains all the monthly (quarterly) dates in-between.

val df1 = df.select($"id", $"startDate", $"endDate").
// This just creates an array of start and end Date
withColumn("start_end_array"), array($"startDate", $"endDate").
withColumn("start_end_array"), explode($"start_end_array"))

Thank you for any leads.


Solution

  • case class MyData(id: String, startDate: String, endDate: String, list: List[String])
    val inputData = Seq(("1_1", "2018-02-07", "2018-04-28"), ("1_2", "2018-05-06", "2018-05-31"), ("2_2", "2017-04-13", "2017-04-14"))
    inputData.map(x => {
      import java.time.temporal._
      import java.time._
      val startDate = LocalDate.parse(x._2)
      val endDate = LocalDate.parse(x._3)
      val diff = ChronoUnit.MONTHS.between(startDate, endDate)
      var result = List[String]();
      for (index <- 0 to diff.toInt) {
        result = (startDate.getYear + "-" + (startDate.getMonth.getValue + index) + "-01") :: result
      }
      new MyData(x._1, x._2, x._3, result)
    }).foreach(println)