Search code examples
apache-sparkpyspark

Pyspark Generate rows depending on column value


Below is the data input,

|       start       |   format_date     |    diff|
+-------------------+-------------------+--------+
|2019-11-15 20:30:00|2019-11-15 18:30:00|     4  |

Expected output:

start                     format_date                      Diff                    seq
2019-11-15 20:30:00     2019-11-15 18:30:00                  4                       1
2019-11-15 20:30:00     2019-11-15 18:30:00                  4                       2
2019-11-15 20:30:00     2019-11-15 18:30:00                  4                       3
2019-11-15 20:30:00     2019-11-15 18:30:00                  4                       4

how do i generate rows depending up on the value (diff) of a column?


Solution

  • Spark < 2.4

    You can use explode function

    import pyspark.sql.functions as F
    import pyspark.sql.types as Types
    
    def rangeArr(diff):
      return range(1,diff+1)
    rangeUdf = F.udf(rangeArr, Types.ArrayType(Types.IntegerType()))
    
    df = df.withColumn('seqArr', rangeUdf('diff'))
    
    df = df.withColumn('seq', F.explode('seqArr'))