I have a PySpark dataframe with 50k records (dfa) and another with 40k records (dfb). In dfa, I want to create a new column tagging the 40k records in dfb with 'present' else 'not_present'.
I know pandas has syntax for this but I'm having trouble finding the PySpark syntax.
Input: dfa
col1 | col2 |
---|---|
xyz | row |
abc | row |
def | row |
df2
col1 | col2 |
---|---|
xyz | row |
abc | row |
Expected Output:
df3
col1 | col2 | col3 |
---|---|---|
xyz | row | present |
abc | row | present |
def | row | not_pre |
df3 = df1.join(df2.select('col1', F.lit('present').alias('col3')).distinct(), 'col1', 'left')
df3 = df3.fillna('not_pre', 'col3')
Full example:
from pyspark.sql import functions as F
df1 = spark.createDataFrame(
[('xyz', 'row'),
('abc', 'row'),
('def', 'row')],
['col1', 'col2']
)
df2 = spark.createDataFrame(
[('xyz', 'row'),
('abc', 'row')],
['col1', 'col2']
)
df3 = df1.join(df2.select('col1', F.lit('present').alias('col3')).distinct(), 'col1', 'left')
df3 = df3.fillna('not_pre', 'col3')
df3.show()
# +----+----+-------+
# |col1|col2| col3|
# +----+----+-------+
# | xyz| row|present|
# | abc| row|present|
# | def| row|not_pre|
# +----+----+-------+