I need to make a copy of a column on pyspark and pad it 1 at the top to shift the column down by 1. Here is an example:
old df
ID | count|
+----------+--------+
| 11| 1666030|
| 01| 21924|
| 99| 1487407|
| 09|43905646|
| 02| 21420|
desired result
ID | count| paddedID |
+----------+--------+------------+
| 11| 1666030| 00 |
| 01| 21924| 11 |
| 99| 1487407| 01 |
| 09|43905646| 99 |
| 02| 21420| 09 |
This can be accomplished fairly easily with the np.pad() function but not sure how I can do it over a large data set on pyspark. Thanks in advance!
EDIT There is a sortable time column (not shown in this toy example here) where we will use to make sure the shift happens the same every time. Thanks for pointing this out everyone.
use lag()
window function to get the desired result.
from pyspark.sql.window import Window
from pyspark.sql.functions import lit, lag
sdf = spark.createDataFrame(pandasdf)
sdf.show()
+---+--------+
| ID| count|
+---+--------+
| 11| 1666030|
| 01| 21924|
| 99| 1487407|
| 09|43905646|
| 02| 21420|
+---+--------+
paddedID
with the previous value of each record, using lag()
window functionwindow_spec = Window.orderBy(lit(0))
sdf.withColumn("paddedID", lag("ID", 1, "00").over(window_spec)).show()
Output
+---+--------+--------+
| ID| count|paddedID|
+---+--------+--------+
| 11| 1666030| 00|
| 01| 21924| 11|
| 99| 1487407| 01|
| 09|43905646| 99|
| 02| 21420| 09|
+---+--------+--------+