Search code examples
apache-sparkpysparkapache-spark-sqlsparkr

windowPartitionBy and repartition in pyspark


I have a small code in SparkR and I would like to transform it to pyspark. I am not familiar with this windowPartitionBy, and repartition. Could you please help me to learn what this code is doing?

ws <- orderBy(windowPartitionBy('A'),'B')
df1 <- df %>% select(df$A, df$B, df$D, SparkR::over(lead(df$C,1),ws))
df2 <- repartition(col = df1$D)

Solution

  • In pyspark it would be equivalent to:

    from pyspark.sql import functions as F, Window
    ws = Window.partitionBy('A').orderBy('B')
    df1 = df.select('A', 'B', 'D', F.lead('C', 1).over(ws))
    df2 = df1.repartition('D')
    

    The code is selecting, from df, columns A, B, D and column C of the next row in the window ws, into df1.

    Then it repartitions df1 using column D into df2. Basically partitioning means how your dataframe is distributed in the memory/storage, and has direct implications on how it is processed in parallel. If you want to know more about repartitioning dataframes, you can go to https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.repartition