I have two different csv's which I read in two dataframes. I want to match the columns df1['building_type] with df2['model'] and append the corresponding row to the df1.
Dataframe 1:
data = [{'length': '34', 'width': '58.5', 'height': '60.2', 'building_type': ['concrete','wood','steel','laminate']},
{'length': '42', 'width': '33', 'height': '23', 'building_type': ['concrete_double','wood_double','steel_double']}]
df1 = pd.DataFrame(data)
print(df1)
Dataframe 2:
data2 = [{'type': 'A1', 'floor': '2', 'model': ['wood','laminate','concrete','steel']},
{'type': 'B3', 'floor': '4', 'model': ['wood_double','concrete_double','steel_double']}]
df2=pd.DataFrame(data2)
print(df2)
Final dataframe:
length width height building_type type floor
0 34 58.5 60.2 [concrete, wood, steel, laminate] A1 2
1 42 33 23 [concrete_double, wood_double, steel_double] B3 4
pd.merge
seems to be the necessary tool here but we'll need an immutable dtype. list
is mutable and cannot be joined on. We can convert list
(mutable) to tuple
or frozenset
, both of which are immutable and can be used to join on. Since the sample output shows that order doesn't matter, I went with frozenset
.
Here is the code:
import pandas as pd
data = [{'length': '34', 'width': '58.5', 'height': '60.2', 'building_type': ['concrete','wood','steel','laminate']},
{'length': '42', 'width': '33', 'height': '23', 'building_type': ['concrete_double','wood_double','steel_double']}]
df1 = pd.DataFrame(data)
print(df1)
data2 = [{'type': 'A1', 'floor': '2', 'model': ['wood','laminate','concrete','steel']},
{'type': 'B3', 'floor': '4', 'model': ['wood_double','concrete_double','steel_double']}]
df2=pd.DataFrame(data2)
print(df2)
# Note: Merge fails on mutable dtype
# pd.merge(df1, df2, left_on='building_type', right_on='model')
# Produces `TypeError: unhashable type: 'list'`
# Convert mutable type to immutable type and merge.
# `tuple` is best if order matters for you. I am assuming that the
# order doesn't matter based on the sample output, so `frozenset` is more
# appropriate.
df1['building_type'] = df1['building_type'].apply(frozenset)
df2['model'] = df2['model'].apply(frozenset)
# Now, merge. Note that since column names are different both
# 'building_type' and 'model' would be retained. You can remove one of them.
final_df = pd.merge(df1, df2, left_on='building_type', right_on='model')
final_df = final_df.drop(['model'], axis=1)
print(final_df)
Output on my machine:
length width height building_type
0 34 58.5 60.2 [concrete, wood, steel, laminate]
1 42 33 23 [concrete_double, wood_double, steel_double]
type floor model
0 A1 2 [wood, laminate, concrete, steel]
1 B3 4 [wood_double, concrete_double, steel_double]
length width height building_type type floor
0 34 58.5 60.2 (laminate, wood, steel, concrete) A1 2
1 42 33 23 (concrete_double, steel_double, wood_double) B3 4