I have a dataframe with numbered columns (1 to 12), and I need logic that combines withColumn and a for loop. Given this sample dataframe:
df = spark.createDataFrame(
[(100, 100, 'A', 'A'),
(1000, 200, 'A', 'A'),
(1000, 300, 'B', 'A'),
(1000, 1000, 'B', 'B')],
"days1 int, days2 int, code1 string, code2 string")
df.show()
+-----+-----+-----+-----+
|days1|days2|code1|code2|
+-----+-----+-----+-----+
| 100| 100| A| A|
| 1000| 200| A| A|
| 1000| 300| B| A|
| 1000| 1000| B| B|
+-----+-----+-----+-----+
I need to create a new column like so: if days(n) > 100, then new(n) = code(n), else 'X'
df.withColumn('new1', F.when(F.col('days1') > 100, F.col('code1')).otherwise('X')).show()
+-----+-----+-----+-----+----+
|days1|days2|code1|code2|new1|
+-----+-----+-----+-----+----+
| 100| 100| A| A| X|
| 1000| 200| A| A| A|
| 1000| 300| B| A| B|
| 1000| 1000| B| B| B|
+-----+-----+-----+-----+----+
I know there's a for loop with select, such as this:
df.select(*[F.col(f'days{i+1}') for i in range(2)], *[F.col(f'code{i+1}') for i in range(2)]).show()
+-----+-----+-----+-----+
|days1|days2|code1|code2|
+-----+-----+-----+-----+
| 100| 100| A| A|
| 1000| 200| A| A|
| 1000| 300| B| A|
| 1000| 1000| B| B|
+-----+-----+-----+-----+
I tried like this but getting an error. Is this possible?
df.withColumn(*[(f'new{i+1}', F.when(F.col(f'days{i+1}') > 100, F.col(f'code{i+1}')).otherwise('X'))
for i in range(2)])\
.select('days1', 'code1', 'new1')\
.show()
TypeError: col should be Column
The select function accepts an array of columns, however the withColumn has two paramters columnName and value. If you want to dynamically add columns do something like:
for i in range(2):
df = df.withColumn(f'new{i+1}', F.when(F.col(f'days{i+1}') > 100, F.col(f'code{i+1}')).otherwise('X'))