Search code examples
pythonmergekeymapping

How to merge 2 data frames with different keys


Can you please help on my case as below. I want to merge 2 dataset to get the expected results as below:

Df 1: A data frame summarises value of features of each ID as below:

id feature 1 feature 2
1 1.2 3.4
2 2.3 1.2
3 3.5 6

Df 2: Mapping tables to determine corresponding segment for each feature based on min/max thresholds of each features |Feature|Min |Max|segment| |--| --- | --- |---| |Feature 1 | 0 |1|1| |Feature 1 |1|2|2| |Feature 1 |2 |inf|3| |Feature 2 | 0 |4|1| |Feature 2 |4|5|2| |Feature |5 |inf|3|

Expected results: I want to merge df1 with mapping table in df2 to get corresponding segment

id feature 1 feature 2 feature 1 segment feature 2 segment
1 1.2 3.4 2 1
2 2.3 1.2 3 1
3 3.5 6 3 3

Thanks a lot for helps


Solution

  • import pandas as pd
    import numpy as np
    
    df1 = pd.DataFrame({'id': [1, 2, 3], 'feature 1': [1.2, 2.3, 3.5], 'feature 2': [3.4, 1.2, 6]})
    
    df2 = pd.DataFrame({'Feature': ['feature 1', 'feature 1', 'feature 1', 'feature 2', 'feature 2', 'feature 2'],
                        'Min': [0, 1, 2, 0, 4, 5], 'Max': [1, 2, np.inf, 4, 5, np.inf], 'segment': [1, 2, 3, 1, 2, 3]})
    
    
    df1 = df1.set_index('id')
    
    def func_data(x, q):
        df = df2[df2['Feature'] == q]
        ttt = np.where((x.values[0] >= df['Min']) & (x.values[0] <= df['Max']))
        index = df1.index[ttt][0]
    
        return index
    
    
    df1['feature_1_segment'] = df1.groupby(['id'])['feature 1'].apply(func_data, 'feature 1')
    df1['feature_2_segment'] = df1.groupby(['id'])['feature 2'].apply(func_data, 'feature 2')
    df1 = df1.reset_index()
    
    print(df1)
    

    Output

       id  feature 1  feature 2  feature_1_segment  feature_2_segment
    0   1        1.2        3.4                  2                  1
    1   2        2.3        1.2                  3                  1
    2   3        3.5        6.0                  3                  3
    

    Here, first the 'id' column is set as an index. A 'func_data ' function is created to determine which range each number falls into. The np.where function from numpy is used to test for a range. Get indexes by 'id'.