Search code examples
numpyapache-sparkpysparklarge-data

How do I pad a pyspark column array like numpy?


I need to make a copy of a column on pyspark and pad it 1 at the top to shift the column down by 1. Here is an example:
old df

       ID  |   count|
+----------+--------+
|        11| 1666030|
|        01|   21924|
|        99| 1487407|
|        09|43905646|
|        02|   21420|

desired result

       ID  |   count|  paddedID  |
+----------+--------+------------+
|        11| 1666030|    00      |
|        01|   21924|    11      |
|        99| 1487407|    01      |
|        09|43905646|    99      |
|        02|   21420|    09      |

This can be accomplished fairly easily with the np.pad() function but not sure how I can do it over a large data set on pyspark. Thanks in advance!

EDIT There is a sortable time column (not shown in this toy example here) where we will use to make sure the shift happens the same every time. Thanks for pointing this out everyone.


Solution

  • use lag() window function to get the desired result.

    1. Importing necessary packages:
    from pyspark.sql.window import Window
    from pyspark.sql.functions import lit, lag
    
    1. Convert your Pandas DataFrame to Spark DataFrame
    sdf = spark.createDataFrame(pandasdf)
    sdf.show()
    
    +---+--------+
    | ID|   count|
    +---+--------+
    | 11| 1666030|
    | 01|   21924|
    | 99| 1487407|
    | 09|43905646|
    | 02|   21420|
    +---+--------+
    
    1. Add a new column paddedID with the previous value of each record, using lag() window function
    window_spec = Window.orderBy(lit(0))
    
    sdf.withColumn("paddedID", lag("ID", 1, "00").over(window_spec)).show()
    

    Output

    +---+--------+--------+
    | ID|   count|paddedID|
    +---+--------+--------+
    | 11| 1666030|      00|
    | 01|   21924|      11|
    | 99| 1487407|      01|
    | 09|43905646|      99|
    | 02|   21420|      09|
    +---+--------+--------+