Search code examples
pythonpysparkwindow

Generate repeating row number based on partition column in pyspark


I want to generate quarterly column shown below which is after every 4 records for each l_id number should change in pyspark. Before generating quarterly column will order data based on l_id and week columns.

enter image description here


Solution

  • My bad, I was thinking that there's a quaterly column already present in your dataframe but it seems you need a column that looks like quaterly. I don't think that is possible via Window function but here's a way to achieve this:

    Assuming your current data is in df.

    from pyspark.sql.functions import split
    
    split_col = split(df["week"],'month')
    df = df.withColumn("quaterly", (split_col.getItem(1).cast("integer")/(df["sequence_change"] + lit(1))).cast("integer") + lit(1)).orderBy("l_id","week")
    

    Logic explanation: We are going to get the month number from week column values, cast it into an integer from string and divide it with the sequence_change value + 1 and casting final value into an integer so you can just get an integer value for it with no decimals. At last adding 1 in it so that quaterly column starts with 1 instead of 0.