Below is the data input,
| start | format_date | diff|
+-------------------+-------------------+--------+
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4 |
Expected output:
start format_date Diff seq
2019-11-15 20:30:00 2019-11-15 18:30:00 4 1
2019-11-15 20:30:00 2019-11-15 18:30:00 4 2
2019-11-15 20:30:00 2019-11-15 18:30:00 4 3
2019-11-15 20:30:00 2019-11-15 18:30:00 4 4
how do i generate rows depending up on the value (diff) of a column?
Spark < 2.4
You can use explode function
import pyspark.sql.functions as F
import pyspark.sql.types as Types
def rangeArr(diff):
return range(1,diff+1)
rangeUdf = F.udf(rangeArr, Types.ArrayType(Types.IntegerType()))
df = df.withColumn('seqArr', rangeUdf('diff'))
df = df.withColumn('seq', F.explode('seqArr'))