python dataframe pyspark apache-spark-sql conditional-statements

How to change specific value in a row, using dictionary and conditions?

I would like to change value in one specific row based on value from other columns in the same row. I tried with method .withColumn() but it doesn't work as I wanted. What I want to get - set flag 1 when the object has date in dataframe just like in the given dictionary:

objects_ = {'x': [datetime.date(2010, 1, 1), datetime.date(2012, 1, 9), datetime.date(2012, 11, 1)], 'y': [datetime.date(2010, 5, 23), datetime.date(2002, 4, 3)]}

date	flag	obj
2010-01-01	0	x
2010-02-03	0	x
2010-02-04	0	x
2010-05-23	0	y
2010-10-13	0	y

date	flag	obj
2010-01-01	1	x
2010-02-03	0	x
2010-02-04	0	x
2010-05-23	1	y
2010-10-13	0	y

Dates are datetime.date type and dataframe is pyspark's.

Flag for date 2010-05-23 would be 1 only for object y (not for x too) as its listed for y key in dictionary.

I've made something like this, but I failed. Perhaps, there is some method to avoid the loops?

for x in x_list:
    for y in x_dict[x]:
        df = df.withColumn("flag", when(df["date"] == y, 1).otherwise(df["flag"]))

Solution

you can create case whens for each of the keys and then pass them to a coalesce.

here's the approach

mapper = [func.when((func.col('obj') == k) & func.col('date').isin(v), 1) for k, v in objects_.items()]
# [Column<'CASE WHEN ((obj = x) AND (date IN (DATE '2010-01-01', DATE '2012-01-09', DATE '2012-11-01'))) THEN 1 ELSE obj END'>,
#  Column<'CASE WHEN ((obj = y) AND (date IN (DATE '2010-05-23', DATE '2002-04-03'))) THEN 1 ELSE obj END'>]

data_sdf. \
    withColumn('flag_new', func.coalesce(*mapper, 'flag')). \
    show()

# +----------+----+---+--------+
# |      date|flag|obj|flag_new|
# +----------+----+---+--------+
# |2010-01-01|   0|  x|       1|
# |2010-02-03|   0|  x|       0|
# |2010-02-04|   0|  x|       0|
# |2010-05-23|   0|  y|       1|
# |2010-10-13|   0|  y|       0|
# +----------+----+---+--------+