I would like to change value in one specific row based on value from other columns in the same row. I tried with method .withColumn()
but it doesn't work as I wanted. What I want to get - set flag 1
when the object has date in dataframe just like in the given dictionary:
objects_ = {'x': [datetime.date(2010, 1, 1), datetime.date(2012, 1, 9), datetime.date(2012, 11, 1)], 'y': [datetime.date(2010, 5, 23), datetime.date(2002, 4, 3)]}
date | flag | obj |
---|---|---|
2010-01-01 | 0 | x |
2010-02-03 | 0 | x |
2010-02-04 | 0 | x |
2010-05-23 | 0 | y |
2010-10-13 | 0 | y |
to
date | flag | obj |
---|---|---|
2010-01-01 | 1 | x |
2010-02-03 | 0 | x |
2010-02-04 | 0 | x |
2010-05-23 | 1 | y |
2010-10-13 | 0 | y |
Dates are datetime.date type and dataframe is pyspark's.
Flag for date 2010-05-23
would be 1
only for object y
(not for x
too) as its listed for y
key in dictionary.
I've made something like this, but I failed. Perhaps, there is some method to avoid the loops?
for x in x_list:
for y in x_dict[x]:
df = df.withColumn("flag", when(df["date"] == y, 1).otherwise(df["flag"]))
you can create case whens for each of the keys and then pass them to a coalesce
.
here's the approach
mapper = [func.when((func.col('obj') == k) & func.col('date').isin(v), 1) for k, v in objects_.items()]
# [Column<'CASE WHEN ((obj = x) AND (date IN (DATE '2010-01-01', DATE '2012-01-09', DATE '2012-11-01'))) THEN 1 ELSE obj END'>,
# Column<'CASE WHEN ((obj = y) AND (date IN (DATE '2010-05-23', DATE '2002-04-03'))) THEN 1 ELSE obj END'>]
data_sdf. \
withColumn('flag_new', func.coalesce(*mapper, 'flag')). \
show()
# +----------+----+---+--------+
# | date|flag|obj|flag_new|
# +----------+----+---+--------+
# |2010-01-01| 0| x| 1|
# |2010-02-03| 0| x| 0|
# |2010-02-04| 0| x| 0|
# |2010-05-23| 0| y| 1|
# |2010-10-13| 0| y| 0|
# +----------+----+---+--------+