Search code examples
dataframepysparkcategorical-data

how to create & sort by an ordered categorical variable in pyspark


I'm migrating some code from pandas to pyspark. My source dataframe looks like this:

   a         b  c
0  1    insert  1
1  2    update  1
2  3      seed  1
3  4    insert  2
4  5    update  2
5  6    delete  2
6  7  snapshot  1

and the operation (in python / pandas) that I'm applying is:

df.b = pd.Categorical(df.b, ordered=True, categories=['insert', 'seed', 'update', 'snapshot', 'delete'])    
df.sort_values(['c', 'b'])

resulting in the output dataframe:

   a         b  c
0  1    insert  1
2  3      seed  1
1  2    update  1
6  7  snapshot  1
3  4    insert  2
4  5    update  2
5  6    delete  2

I'm unsure how best to set up ordered categoricals using pyspark, and my initial approach creates a new column using case-when and attempts to use that subsequently:

df = df.withColumn(
    "_precedence",
    when(col("b") == "insert", 1)
    .when(col("b") == "seed", 2)
    .when(col("b") == "update", 3)
    .when(col("b") == "snapshot", 4)
    .when(col("b") == "delete", 5)
)

Solution

  • You can use a map:

    from pyspark.sql.functions import create_map, lit, col
    
    categories=['insert', 'seed', 'update', 'snapshot', 'delete']
    
    # per @HaleemurAli, adjusted the below list comprehension to create map
    map1 = create_map([val for (i, c) in enumerate(categories) for val in (c, lit(i))])
    #Column<b'map(insert, 0, seed, 1, update, 2, snapshot, 3, delete, 4)'>
    
    df.orderBy('c', map1[col('b')]).show()
    +---+---+--------+---+
    | id|  a|       b|  c|
    +---+---+--------+---+
    |  0|  1|  insert|  1|
    |  2|  3|    seed|  1|
    |  1|  2|  update|  1|
    |  6|  7|snapshot|  1|
    |  3|  4|  insert|  2|
    |  4|  5|  update|  2|
    |  5|  6|  delete|  2|
    +---+---+--------+---+
    

    to reverse the order on column-b: df.orderBy('c', map1[col('b')].desc()).show()