Search code examples
pythonpandasdata-processing

PANDAS - converting a column with lists as values to dummy variables


I'm working with a dataset of airbnb listings. one of the columns is called amenisities, and contains all of the amenisities that listing has to offer. several examples:

[Internet, Wifi, Paid parking off premises]

[Internet, Wifi, Kitchen]

[Wifi, Smoking allowed, Heating]

I would like to replace this column with several binary column, one for each kind of amenisty. so one of them, for example, will be:

wifi --> 0,0,0,1,1,0,1,1,0,1,0,1 

I found a way to achive this with for loops:

all_amenities = []
for row in amenities:
    all_amenities += row

all_amenities = set(all_amenities)
for col in all_amenities:
    df[col] = 0

for i,amenities_of_listing in enumerate(amenities):
    for amenity in amenities_of_listing:
        df.loc[i,amenity] = 1

but this is taking forever to run - can someone here think of a more afficiant way to do this?


Solution

  • I believe you need MultiLabelBinarizer what working nice if large DataFrame:

    print (df)
                                       amenisities
    0  [Internet, Wifi, Paid parking off premises]
    1                    [Internet, Wifi, Kitchen]
    2             [Wifi, Smoking allowed, Heating]
    
    from sklearn.preprocessing import MultiLabelBinarizer
    
    mlb = MultiLabelBinarizer()
    df1 = pd.DataFrame(mlb.fit_transform(df['amenisities']),columns=mlb.classes_)
    print (df1)
       Heating  Internet  Kitchen  Paid parking off premises  Smoking allowed  \
    0        0         1        0                          1                0   
    1        0         1        1                          0                0   
    2        1         0        0                          0                1   
    
       Wifi  
    0     1  
    1     1  
    2     1