Search code examples
pythonpandasapache-sparkdatabricksspark-koalas

Databricks Koalas: use for loop to create new columns with conditions and dynamically name the new column based on the old column names


Example dataset:

kdf = ks.DataFrame({"power_1": [50, 100, 150, 120, 18], 
                   "power_2": [50, 150, 150, 120, 18],
                   "power_3": [60, 100, 150, 120, 18],
                   "power_4": [150, 90, 150, 120, 18],
                   "power_30": [50, 60, 150, 120, 18]
                   })

df = pd.DataFrame({"power_1": [50, 100, 150, 120, 18], 
                   "power_2": [50, 150, 150, 120, 18],
                   "power_3": [60, 100, 150, 120, 18],
                   "power_4": [150, 90, 150, 120, 18],
                   "power_30": [50, 60, 150, 120, 18]
                   })

I know how to do it in pandas. Below are my codes:

cols = df.filter(regex='power_').columns
for col in cols:
    df[col] = pd.to_numeric(df[col],errors='coerce')
    df[col+'_Status']= ['OFF' if x<100 or np.isnan(x) else 'ON' for x in df[col]]

I can create the new columns one by one in Koalas using:

kdf = kdf.assign(power_1_Status=(kdf['power_1'].gt(100)).astype(int).map({0:'OFF',1:'ON'}))

But I don't know how to do it for all power columns because my dataset is really large with 50+ power columns and 1000+ other columns. I am using Databricks. I don't want to write 50+ lines of codes for all the power columns. My problem here is that I don't know how to dynamically add "_Status" to my original column name "power_1" in for loop in Koalas. I tried for loop using similar pandas structure. Here is what I tried but failed.

for col in cols:
    kdf = kdf.assign(col+'Status'=(kdf[col].gt(100)).astype(int).map({0:'OFF',1:'ON'}))

Thanks


Solution

  • I had a look at Koala's documentation and used some functions there that are similar to pandas', so give this a whirl and see if it works:

     #select columns you are interested in into a separate dataframe
     filtered = df.filter(like='power_')
    
     #drop the filtered columns from df
     df = df.drop(filtered.columns,axis=1)
    
     #do some cleanup
     filtered = (filtered
                .astype(int)
                .gt(100)
                .applymap(lambda x: 'OFF' if x==0 else 'ON')
                .add_suffix('_STATUS'))
    
     #hook back to original dataframe
     pd.concat([df,filtered],axis=1)
    

    You might want to use a copy when creating the filtered dataframe; just a suggestion as I don't know what costs are associated with that on Koalas.

     power_1_STATUS power_2_STATUS  power_3_STATUS  power_4_STATUS  power_30_STATUS
    0   OFF              OFF              OFF            ON               OFF
    1   OFF               ON              OFF            OFF              OFF
    2   ON                ON               ON            ON               ON
    3   ON                ON               ON            ON               ON
    4   OFF              OFF              OFF            OFF              OFF