Search code examples
pythonpandaspysparkpyspark-pandas

i want to sum date in a looping 13 times using pyspark


Please help me to solve this issue, as I am still new to Python/Pyspark. I want to do a loop to do a date sum in multiples of 7 for 13 times in the same column.

I have a master table like this :

id date
1 2019-02-21 10:00:00
2 2019-02-27 09:50:33

etc

I have created a script but wrong, instead created a new column, even though I want to create a new column..

I want this kind of result :

id x date
1 7 2019-02-21
1 14 2019-02-28
1 21 2019-03-14
1 28 2019-03-21
1 35 2019-03-28
1 42 2019-04-04
1 49 2019-04-11
1 56 2019-04-18
1 63 2019-04-25
1 70 2019-05-02
1 77 2019-05-09
1 84 2019-05-16
1 91 2019-05-16

etc


Solution

  • Use sequence function and then explode.

    Example:

    df = spark.createDataFrame([('1','2019-02-21 10:00:00'),('2','2019-02-27 09:50:33')],['id','date']).\
      withColumn("seq", expr("""sequence(0,91,7)""")).\
        withColumn("date", to_date(col("date")))
    df = df.select("id","date",explode(col("seq")))
    df.withColumn("date", expr("""date_add(date, col)""")).show(200, False)
    
    #+---+----------+---+
    #|id |date      |col|
    #+---+----------+---+
    #|1  |2019-02-21|0  |
    #|1  |2019-02-28|7  |
    #|1  |2019-03-07|14 |
    #|1  |2019-03-14|21 |
    #|1  |2019-03-21|28 |
    #|1  |2019-03-28|35 |
    #|1  |2019-04-04|42 |
    #|1  |2019-04-11|49 |
    #|1  |2019-04-18|56 |
    #|1  |2019-04-25|63 |
    #|1  |2019-05-02|70 |
    #|1  |2019-05-09|77 |
    #|1  |2019-05-16|84 |
    #|1  |2019-05-23|91 |
    #|2  |2019-02-27|0  |
    #|2  |2019-03-06|7  |
    #|2  |2019-03-13|14 |
    #|2  |2019-03-20|21 |
    #|2  |2019-03-27|28 |
    #|2  |2019-04-03|35 |
    #|2  |2019-04-10|42 |
    #|2  |2019-04-17|49 |
    #|2  |2019-04-24|56 |
    #|2  |2019-05-01|63 |
    #|2  |2019-05-08|70 |
    #|2  |2019-05-15|77 |
    #|2  |2019-05-22|84 |
    #|2  |2019-05-29|91 |
    #+---+----------+---+