Please help me to solve this issue, as I am still new to Python/Pyspark. I want to do a loop to do a date sum in multiples of 7 for 13 times in the same column.
I have a master table like this :
id | date |
---|---|
1 | 2019-02-21 10:00:00 |
2 | 2019-02-27 09:50:33 |
etc
I have created a script but wrong, instead created a new column, even though I want to create a new column..
I want this kind of result :
id | x | date |
---|---|---|
1 | 7 | 2019-02-21 |
1 | 14 | 2019-02-28 |
1 | 21 | 2019-03-14 |
1 | 28 | 2019-03-21 |
1 | 35 | 2019-03-28 |
1 | 42 | 2019-04-04 |
1 | 49 | 2019-04-11 |
1 | 56 | 2019-04-18 |
1 | 63 | 2019-04-25 |
1 | 70 | 2019-05-02 |
1 | 77 | 2019-05-09 |
1 | 84 | 2019-05-16 |
1 | 91 | 2019-05-16 |
etc
Use sequence
function and then explode
.
Example:
df = spark.createDataFrame([('1','2019-02-21 10:00:00'),('2','2019-02-27 09:50:33')],['id','date']).\
withColumn("seq", expr("""sequence(0,91,7)""")).\
withColumn("date", to_date(col("date")))
df = df.select("id","date",explode(col("seq")))
df.withColumn("date", expr("""date_add(date, col)""")).show(200, False)
#+---+----------+---+
#|id |date |col|
#+---+----------+---+
#|1 |2019-02-21|0 |
#|1 |2019-02-28|7 |
#|1 |2019-03-07|14 |
#|1 |2019-03-14|21 |
#|1 |2019-03-21|28 |
#|1 |2019-03-28|35 |
#|1 |2019-04-04|42 |
#|1 |2019-04-11|49 |
#|1 |2019-04-18|56 |
#|1 |2019-04-25|63 |
#|1 |2019-05-02|70 |
#|1 |2019-05-09|77 |
#|1 |2019-05-16|84 |
#|1 |2019-05-23|91 |
#|2 |2019-02-27|0 |
#|2 |2019-03-06|7 |
#|2 |2019-03-13|14 |
#|2 |2019-03-20|21 |
#|2 |2019-03-27|28 |
#|2 |2019-04-03|35 |
#|2 |2019-04-10|42 |
#|2 |2019-04-17|49 |
#|2 |2019-04-24|56 |
#|2 |2019-05-01|63 |
#|2 |2019-05-08|70 |
#|2 |2019-05-15|77 |
#|2 |2019-05-22|84 |
#|2 |2019-05-29|91 |
#+---+----------+---+