dataframe apache-spark join pyspark multiple-columns

Adding columns to a Spark dataframe conditionally

I have a Spark dataframe like the one below:

Also, another dataframe:

The output that I expect is below. I need two columns to be added conditionally to the original data dataframe where:
"offeramount1" = 75% of (amount)
"offeramount2" = 65% of (amount)

This offer is only to be given when the code is not in the "exclusioncode"

I am able to add the columns without any issues using withColumn, but I'm unable to compare the data frames properly.

Solution

You could do two joins - 'leftanti' and 'right':

Input:

from pyspark.sql import functions as F
df1 = spark.createDataFrame([('c1', 10), ('c2', 12), ('c3', 14), ('c4', 16), ('c5', 18), ('c6', 20), ('c7', 22), ('c8', 24), ('c9', 26), ('c10', 28)], ['codes', 'amount'])
df2 = spark.createDataFrame([('c3',), ('c4',), ('c9',)], ['exclusioncode'])

Script:

df = (df1
    .join(df2, df1.codes == df2.exclusioncode, 'leftanti')
    .withColumn('offeramount1', F.round(F.col('amount') * .75, 1))
    .withColumn('offeramount2', F.round(F.col('amount') * .65, 1))
    .join(df1, ['codes', 'amount'], 'right')
)
df.show()
# +-----+------+------------+------------+
# |codes|amount|offeramount1|offeramount2|
# +-----+------+------------+------------+
# |   c3|    14|        null|        null|
# |   c1|    10|         7.5|         6.5|
# |   c4|    16|        null|        null|
# |   c5|    18|        13.5|        11.7|
# |   c2|    12|         9.0|         7.8|
# |   c6|    20|        15.0|        13.0|
# |  c10|    28|        21.0|        18.2|
# |   c8|    24|        18.0|        15.6|
# |   c7|    22|        16.5|        14.3|
# |   c9|    26|        null|        null|
# +-----+------+------------+------------+

You cannot fill null values with "no offer", as columns in Spark contain only one data type. "no offer" would be a string while other values are numeric, hence two separate data types.