Search code examples
dataframeapache-sparkpysparkdatabricks

How can I combine Pyspark withColumn with for in range on dataframe?


I have a dataframe with numbered columns (1 to 12), and I need logic that combines withColumn and a for loop. Given this sample dataframe:

df = spark.createDataFrame(
[(100, 100, 'A', 'A'),
  (1000, 200, 'A', 'A'),
  (1000, 300, 'B', 'A'),
  (1000, 1000, 'B', 'B')],  
"days1 int, days2 int, code1 string, code2 string")

df.show()

+-----+-----+-----+-----+
|days1|days2|code1|code2|
+-----+-----+-----+-----+
|  100|  100|    A|    A|
| 1000|  200|    A|    A|
| 1000|  300|    B|    A|
| 1000| 1000|    B|    B|
+-----+-----+-----+-----+

I need to create a new column like so: if days(n) > 100, then new(n) = code(n), else 'X'

df.withColumn('new1', F.when(F.col('days1') > 100, F.col('code1')).otherwise('X')).show()

+-----+-----+-----+-----+----+
|days1|days2|code1|code2|new1|
+-----+-----+-----+-----+----+
|  100|  100|    A|    A|   X|
| 1000|  200|    A|    A|   A|
| 1000|  300|    B|    A|   B|
| 1000| 1000|    B|    B|   B|
+-----+-----+-----+-----+----+

I know there's a for loop with select, such as this:

df.select(*[F.col(f'days{i+1}') for i in range(2)], *[F.col(f'code{i+1}') for i in range(2)]).show()

+-----+-----+-----+-----+
|days1|days2|code1|code2|
+-----+-----+-----+-----+
|  100|  100|    A|    A|
| 1000|  200|    A|    A|
| 1000|  300|    B|    A|
| 1000| 1000|    B|    B|
+-----+-----+-----+-----+

I tried like this but getting an error. Is this possible?

    df.withColumn(*[(f'new{i+1}', F.when(F.col(f'days{i+1}') > 100, F.col(f'code{i+1}')).otherwise('X')) 
for i in range(2)])\
      .select('days1', 'code1', 'new1')\
      .show()
    
    TypeError: col should be Column

Solution

  • The select function accepts an array of columns, however the withColumn has two paramters columnName and value. If you want to dynamically add columns do something like:

    for i in range(2):
        df = df.withColumn(f'new{i+1}', F.when(F.col(f'days{i+1}') > 100, F.col(f'code{i+1}')).otherwise('X'))