Search code examples
pythondataframepysparkreplacerow

Pyspark find existing set of rows in a dataframe and replace it with values from another dataframe


I have a Pyspark dataframe_Old (dfo) as below:

Id neighbor_sid neighbor division
a1 1100 Naalehu Hawaii
a2 1101 key-west-fl Miami
a3 1102 lubbock Texas
a10 1202 bay-terraces California

I have a Pyspark dataframe_new (dfn) as below:

Id neighbor_sid neighbor division
a1 1100 Naalehu Hawaii
a2 1111 key-largo-fl Miami
a3 1103 grapevine Texas
a4 1115 meriden-ct Connecticut
a12 2002 east-louisville Kentucky

My objective is to find the Ids from dataframe_new in dataframe_old and replace them with the new values from dataframe_new

Final expected Pyspark dataframe updated - dataframe_old

Id neighbor_sid neighbor division
a1 1100 Naalehu Hawaii
a2 1111 key-largo-fl Miami
a3 1103 grapevine Texas
a4 1115 meriden-ct Connecticut
a10 1202 bay-terraces California
a12 2002 east-louisville Kentucky

My wrong attempt at solving it as it is comparing column wise instead of row

dfO.alias('a').join(dfN.alias('b'), on=['id'], how='left')\
    .select(
        'id',
        f.when(
            ~f.isnull(f.col('b.id')),
            f.col('b.id')
        ).otherwise(f.col('a.id')).alias('id'),
        'b.col_3'
    )\
    .union(dfO)\
    .dropDuplicates()\
    .sort('id')\
    .show()

Please help - would really appreciate any guidance!


Solution

  • We can do an outer join on the id fields and then use coalesce() to prioritize the fields from dfn.

    columns = ['id', 'neighbor_sid', 'neighbor', 'division']
    
    dfo. \
        join(dfn, 'id', 'outer'). \
        select(*['id'] + [func.coalesce(dfn[k], dfo[k]).alias(k) for k in columns if k != 'id']). \
        orderBy('id'). \
        show()
    
    # +---+------------+------------+-----------+
    # | id|neighbor_sid|    neighbor|   division|
    # +---+------------+------------+-----------+
    # | a1|        1100|     Naalehu|     Hawaii|
    # |a10|        1202|bay-terraces| California|
    # | a2|        1111|key-largo-fl|      Miami|
    # | a3|        1103|   grapevine|      Texas|
    # | a4|        1115|  meriden-ct|Connecticut|
    # +---+------------+------------+-----------+