Search code examples
pythondatabricksspark-koalas

Databricks Koalas Column Assignment Based on Another COlumn Value Lambda Function


Given a koalas Dataframe:

df = ks.DataFrame({"high_risk": [0, 1, 0, 1, 1], 
                   "medium_risk": [1, 0, 0, 0, 0]
                   })

Running a lambda function to get a new column based on the existing column values:

df = df.assign(risk=lambda x: "High" if x.high_risk else ("Medium" if x.medium_risk else "Low"))
df
Out[72]: 
   high_risk  medium_risk  risk
0          0            1  High
4          1            0  High
1          1            0  High
2          0            0  High
3          1            0  High

Expected return:

       high_risk  medium_risk  risk
    0          0            1  Medium
    4          1            0  High
    1          1            0  High
    2          0            0  Low
    3          1            0  High

Why does this assign "High" to each of the values. The intent is to operations on each row, is it looking at the whole column in the comparison?


Solution

  • Using assign on a koalas df seems not easy to me, but for your case, I would mul the column 'high_risk' by 2 then add the column 'medium_risk' and finally map the result to replace the 2 by 'high' (because you multiply the column by 2 before) 1 by 'medium' and 0 by 'low' such as:

    df = df.assign(risk= df.high_risk.mul(2).add(df.medium_risk)
                           .map({0:'low', 1:'medium', 2:'high'}))
    df
       high_risk  medium_risk    risk
    0          0            1  medium
    1          1            0    high
    2          0            0     low
    3          1            0    high
    4          1            0    high
    

    Note : this would fail if you have 1 in both high and medium risks column.