Search code examples
dataframeapache-sparkpysparkapache-spark-sqloverriding

Overriding values in a column to the next value on list


I'm trying to preprocess a column in Spark dataframe, the column contains int values for example [41,43,45,59,72]. I'm trying to process that data to get the next value with a 5 steps for example 41->40, 43->45, 45->45, 59->60 ....

How can I do it in the most optimal way in a PySpark dataframe?


Solution

  •  F.round(F.col('c1') / 5) * 5
    
    from pyspark.sql import functions as F
    df = spark.createDataFrame([(41,), (43,), (45,), (59,), (72,)], ['c1'])
    
    df = df.withColumn('c2', (F.round(F.col('c1') / 5) * 5).cast('int'))
    
    df.show()
    # +---+---+
    # | c1| c2|
    # +---+---+
    # | 41| 40|
    # | 43| 45|
    # | 45| 45|
    # | 59| 60|
    # | 72| 70|
    # +---+---+
    

    To override, instead of a new name, use the existing column name:

    from pyspark.sql import functions as F
    df = spark.createDataFrame([(41,), (43,), (45,), (59,), (72,)], ['c1'])
    
    df = df.withColumn('c1', (F.round(F.col('c1') / 5) * 5).cast('int'))
    
    df.show()
    # +---+
    # | c1|
    # +---+
    # | 40|
    # | 45|
    # | 45|
    # | 60|
    # | 70|
    # +---+