Search code examples
pythonone-hot-encoding

Replace one_hot_encoding 0/1 with values from other column


I have a data frame of transactions with duplicate users. I want to aggregate my data and build a data frame with one row for each user.

The original data frame looks like this:

id       usage     country    publisher      type_status
123      6.77         US          X          bookstore_declined
123      4.25         US          X          super_approved
123      88.7         US          X          bookstore_approved
123      5.6          US          X          pharmacies_approved
456      43.66        BZ          Y          pharmacies_approved
456      56.87        BZ          Y          super_approved
456      9.65         BZ          Y          bookstore_approved

I want to use one_hot_encoding on the type_status feature, but I want that instead of 0/1 in the new dummy columns the new columns will have the 'usage' value.

Here an example for what I'm looking for:

id   country  publisher     bookstore_declined  super_approved  bookstore_approved   
123    US        X            6.77                4.25             88.7
456    BZ        Y             0                  56.87            9.65

Here is my code: How can I replace 0/1 with the usage values?

df=pd.get_dummies(df,columns=['type_status'],drop_first=True)

Solution

  • Hi I init the data so i get a dataframe as well

    import pandas as pd
    
    test_df = {
        'id': [123,123,123,123,456,456,456],
        'usage' :[6.77,4.25,88.7,5.6,43.66,56.87,9.65],
        'country' : ['US','US','US','US','BZ','BZ','BZ'],
        'publisher' : ['x','x','x','x', 'y','y','y'],
        'type_status': ['bookstore_declined','super_approved','bookstore_approved','pharmacies_approved','pharmacies_approved', 'super_approved','bookstore_approved']
    }
    
    df = pd.DataFrame(test_df)
    
    df=pd.get_dummies(df,columns=['type_status'],drop_first=True)
    

    The result looks like yours

        id  usage country publisher  type_status_bookstore_declined  ...
    0  123   6.77      US         x                               1   
    1  123   4.25      US         x                               0   
    2  123  88.70      US         x                               0   
    

    According to this Stackoverflow answer you can multiply multiple columns with following command:

    df.update(df.iloc[:, 4:7].mul(df.usage, 0))
    

    remove the usage column:

    df = df.drop('usage', axis=1)
    

    Result looks like this print:

    id  country     publisher   type_status_bookstore_declined  type_status_pharmacies_approved     type_status_super_approved
    0   123     US  x   6.77    0.00    0.00
    1   123     US  x   0.00    0.00    4.25
    2   123     US  x   0.00    0.00    0.00
    3   123     US  x   0.00    5.60    0.00
    4   456     BZ  y   0.00    43.66   0.00
    5   456     BZ  y   0.00    0.00    56.87
    6   456     BZ  y   0.00    0.00    0.00