I'm working with a dataset of airbnb listings. one of the columns is called amenisities, and contains all of the amenisities that listing has to offer. several examples:
[Internet, Wifi, Paid parking off premises]
[Internet, Wifi, Kitchen]
[Wifi, Smoking allowed, Heating]
I would like to replace this column with several binary column, one for each kind of amenisty. so one of them, for example, will be:
wifi --> 0,0,0,1,1,0,1,1,0,1,0,1
I found a way to achive this with for loops:
all_amenities = []
for row in amenities:
all_amenities += row
all_amenities = set(all_amenities)
for col in all_amenities:
df[col] = 0
for i,amenities_of_listing in enumerate(amenities):
for amenity in amenities_of_listing:
df.loc[i,amenity] = 1
but this is taking forever to run - can someone here think of a more afficiant way to do this?
I believe you need MultiLabelBinarizer
what working nice if large DataFrame
:
print (df)
amenisities
0 [Internet, Wifi, Paid parking off premises]
1 [Internet, Wifi, Kitchen]
2 [Wifi, Smoking allowed, Heating]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['amenisities']),columns=mlb.classes_)
print (df1)
Heating Internet Kitchen Paid parking off premises Smoking allowed \
0 0 1 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1
Wifi
0 1
1 1
2 1