Search code examples
pythonapache-sparkpysparkapache-spark-sqlisin

Pypsark isin variable assignment


I have a PySpark dataframe with 50k records (dfa) and another with 40k records (dfb). In dfa, I want to create a new column tagging the 40k records in dfb with 'present' else 'not_present'.

I know pandas has syntax for this but I'm having trouble finding the PySpark syntax.

Input: dfa

col1 col2
xyz row
abc row
def row

df2

col1 col2
xyz row
abc row

Expected Output:

df3

col1 col2 col3
xyz row present
abc row present
def row not_pre

Solution

  • df3 = df1.join(df2.select('col1', F.lit('present').alias('col3')).distinct(), 'col1', 'left')
    df3 = df3.fillna('not_pre', 'col3')
    

    Full example:

    from pyspark.sql import functions as F
    
    df1 = spark.createDataFrame(
        [('xyz', 'row'),
         ('abc', 'row'),
         ('def', 'row')],
        ['col1', 'col2']
    )
    df2 = spark.createDataFrame(
        [('xyz', 'row'),
         ('abc', 'row')],
        ['col1', 'col2']
    )
    
    df3 = df1.join(df2.select('col1', F.lit('present').alias('col3')).distinct(), 'col1', 'left')
    df3 = df3.fillna('not_pre', 'col3')
    
    df3.show()
    # +----+----+-------+
    # |col1|col2|   col3|
    # +----+----+-------+
    # | xyz| row|present|
    # | abc| row|present|
    # | def| row|not_pre|
    # +----+----+-------+