Search code examples
pyspark

Summarize low values into one


I have a table that looks something like this:

name value
optionA 123
optionB 50
optionC 32
optionD 4
optionE 2
optionF 1

I don't need the specifics about values lower that 10. So I just want to merge those values into one with the name "other" and the sum of those low values. Result would look like this:

name value
optionA 123
optionB 50
optionC 32
other 7

What's the easiest way to do that in databricks using pyspark?


Solution

  • Let us do

    name = F.expr("IF (value < 10, 'other', name)")
    df.withColumn('name', name).groupby('name').agg(F.sum('value').alias('value'))
    

    +-------+-----+
    |   name|value|
    +-------+-----+
    |optionA|  123|
    |optionB|   50|
    |optionC|   32|
    |  other|    7|
    +-------+-----+