Search code examples
pythonpysparkregexp-replace

regexp_replace / not contains certain strings


I am trying to make a new column by replacing certain strings.

df = df.withColumn('new_key_location', F.regexp_replace('key_location'), ^'Seoul|Jeju|Busan', 'Others')

I want to change other location names in the key_location column not containing 'Seoul|Jeju|Busan' strings to 'Others'.

What can I do?


Solution

  • You could just use .isin or .like with a when().otherwise() instead of regexp_replace.

    reqd_cities = ['Seoul', 'Jeju', 'Busan']
    
    data_sdf. \
        withColumn('new_key_location', 
                   func.when(func.col('key_location').isin(reqd_cities), func.col('key_location').
                             otherwise(func.lit('Others'))
                             )
                   )
    

    OR

    data_sdf. \
        withColumn('new_key_location', 
                   func.when(func.col('key_location').like('%Seoul%') | 
                             func.col('key_location').like('%Jeju%') |
                             func.col('key_location').like('%Busan%'), func.col('key_location')).
                   otherwise(func.lit('Others'))
                   )