Search code examples
pythonapache-sparkdataframepysparkapache-spark-sql

Pyspark: Replacing value in a column by searching a dictionary


I'm a newbie in PySpark.

I have a Spark DataFrame df that has a column 'device_type'.

I want to replace every value that is in "Tablet" or "Phone" to "Phone", and replace "PC" to "Desktop".

In Python I can do the following,

deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df['device_type'] = df['device_type'].replace(deviceDict,inplace=False)

How can I achieve this using PySpark? Thanks!


Solution

  • You can use either na.replace:

    df = spark.createDataFrame([
        ('Tablet', ), ('Phone', ),  ('PC', ), ('Other', ), (None, )
    ], ["device_type"])
    
    df.na.replace(deviceDict, 1).show()
    
    +-----------+
    |device_type|
    +-----------+
    |     Mobile|
    |     Mobile|
    |    Desktop|
    |      Other|
    |       null|
    +-----------+
    

    or map literal:

    from itertools import chain
    from pyspark.sql.functions import create_map, lit
    
    mapping = create_map([lit(x) for x in chain(*deviceDict.items())])
    
    
    df.select(mapping[df['device_type']].alias('device_type'))
    
    +-----------+
    |device_type|
    +-----------+
    |     Mobile|
    |     Mobile|
    |    Desktop|
    |       null|
    |       null|
    +-----------+
    

    Please note that the latter solution will convert values not present in the mapping to NULL. If this is not a desired behavior you can add coalesce:

    from pyspark.sql.functions import coalesce
    
    
    df.select(
        coalesce(mapping[df['device_type']], df['device_type']).alias('device_type')
    )
    
    +-----------+
    |device_type|
    +-----------+
    |     Mobile|
    |     Mobile|
    |    Desktop|
    |      Other|
    |       null|
    +-----------+