Search code examples
dataframeapache-sparkjoinpysparkmultiple-columns

Adding columns to a Spark dataframe conditionally


I have a Spark dataframe like the one below:

enter image description here

Also, another dataframe:

enter image description here

The output that I expect is below. I need two columns to be added conditionally to the original data dataframe where:
"offeramount1" = 75% of (amount)
"offeramount2" = 65% of (amount)

This offer is only to be given when the code is not in the "exclusioncode"

enter image description here

I am able to add the columns without any issues using withColumn, but I'm unable to compare the data frames properly.


Solution

  • You could do two joins - 'leftanti' and 'right':

    Input:

    from pyspark.sql import functions as F
    df1 = spark.createDataFrame([('c1', 10), ('c2', 12), ('c3', 14), ('c4', 16), ('c5', 18), ('c6', 20), ('c7', 22), ('c8', 24), ('c9', 26), ('c10', 28)], ['codes', 'amount'])
    df2 = spark.createDataFrame([('c3',), ('c4',), ('c9',)], ['exclusioncode'])
    

    Script:

    df = (df1
        .join(df2, df1.codes == df2.exclusioncode, 'leftanti')
        .withColumn('offeramount1', F.round(F.col('amount') * .75, 1))
        .withColumn('offeramount2', F.round(F.col('amount') * .65, 1))
        .join(df1, ['codes', 'amount'], 'right')
    )
    df.show()
    # +-----+------+------------+------------+
    # |codes|amount|offeramount1|offeramount2|
    # +-----+------+------------+------------+
    # |   c3|    14|        null|        null|
    # |   c1|    10|         7.5|         6.5|
    # |   c4|    16|        null|        null|
    # |   c5|    18|        13.5|        11.7|
    # |   c2|    12|         9.0|         7.8|
    # |   c6|    20|        15.0|        13.0|
    # |  c10|    28|        21.0|        18.2|
    # |   c8|    24|        18.0|        15.6|
    # |   c7|    22|        16.5|        14.3|
    # |   c9|    26|        null|        null|
    # +-----+------+------------+------------+
    

    You cannot fill null values with "no offer", as columns in Spark contain only one data type. "no offer" would be a string while other values are numeric, hence two separate data types.