Search code examples
pythonpandasdataframemergeconcatenation

match the dataframe columns with list values and append dataframe with matching rows


I have two different csv's which I read in two dataframes. I want to match the columns df1['building_type] with df2['model'] and append the corresponding row to the df1.

Dataframe 1:

data = [{'length': '34', 'width': '58.5', 'height': '60.2', 'building_type': ['concrete','wood','steel','laminate']},
       {'length': '42', 'width': '33', 'height': '23', 'building_type': ['concrete_double','wood_double','steel_double']}]
df1 = pd.DataFrame(data)

print(df1)

Dataframe 2:


data2 = [{'type': 'A1', 'floor': '2', 'model': ['wood','laminate','concrete','steel']},
       {'type': 'B3', 'floor': '4',  'model': ['wood_double','concrete_double','steel_double']}]
df2=pd.DataFrame(data2)
print(df2)

Final dataframe:

   length   width   height  building_type                                 type  floor
0   34      58.5    60.2   [concrete, wood, steel, laminate]              A1    2
1   42      33      23     [concrete_double, wood_double, steel_double]   B3    4

Solution

  • pd.merge seems to be the necessary tool here but we'll need an immutable dtype. list is mutable and cannot be joined on. We can convert list (mutable) to tuple or frozenset, both of which are immutable and can be used to join on. Since the sample output shows that order doesn't matter, I went with frozenset.

    Here is the code:

    import pandas as pd
    
    data = [{'length': '34', 'width': '58.5', 'height': '60.2', 'building_type': ['concrete','wood','steel','laminate']},
           {'length': '42', 'width': '33', 'height': '23', 'building_type': ['concrete_double','wood_double','steel_double']}]
    df1 = pd.DataFrame(data)
    print(df1)
    
    data2 = [{'type': 'A1', 'floor': '2', 'model': ['wood','laminate','concrete','steel']},
           {'type': 'B3', 'floor': '4',  'model': ['wood_double','concrete_double','steel_double']}]
    df2=pd.DataFrame(data2)
    print(df2)
    
    
    # Note: Merge fails on mutable dtype
    # pd.merge(df1, df2, left_on='building_type', right_on='model')
    # Produces `TypeError: unhashable type: 'list'`
    
    # Convert mutable type to immutable type and merge.
    # `tuple` is best if order matters for you. I am assuming that the
    # order doesn't matter based on the sample output, so `frozenset` is more
    # appropriate.
    df1['building_type'] = df1['building_type'].apply(frozenset)
    df2['model'] = df2['model'].apply(frozenset)
    
    # Now, merge. Note that since column names are different both
    # 'building_type' and 'model' would be retained. You can remove one of them.
    final_df = pd.merge(df1, df2, left_on='building_type', right_on='model')
    final_df = final_df.drop(['model'], axis=1)
    print(final_df)
    
    

    Output on my machine:

      length width height                                 building_type
    0     34  58.5   60.2             [concrete, wood, steel, laminate]
    1     42    33     23  [concrete_double, wood_double, steel_double]
      type floor                                         model
    0   A1     2             [wood, laminate, concrete, steel]
    1   B3     4  [wood_double, concrete_double, steel_double]
      length width height                                 building_type type floor
    0     34  58.5   60.2             (laminate, wood, steel, concrete)   A1     2
    1     42    33     23  (concrete_double, steel_double, wood_double)   B3     4