Search code examples
pythonpandasdataframe

Combine 2 tables and create new rows if needed


I have 2 tables with a column in common. I want to combine both tables but the common column doesn't have the same exact values in both tables.

Table1:

Rate      |  Age
0-4       |  <6.0
5-6       |  6.0
7-8       |  6.3
9-10      |  <6.6
11-12     |  6.9
13-14     |  7.0
15-16     |  7.3
17-18     |  7.6
19-20     |  7.9
21        |  8.0
22-23     |  8.3
24-25     |  8.6
26        |  8.9
27        |  9.0
28        |  9.3
29        |  9.6
30        |  9.9
31        |  10.0
32        |  10.3
33        |  10.6
34        |  10.9
35        |  11.0
36        |  11.3
37        |  11.6
38        |  11.9
39        |  12.0
40        |  12.3
41        |  12.6
42        |  12.9
43        |  13.0
44        |  13.3
45        |  13.6
46        |  13.9
47        |  14.6
48        |  15.0
49        |  15.3
50        |  15.6
51        |  16.0
52        |  16.6
53        |  17.0
54        |  17.3
55        |  17.6
56        |  18.0
57-80     | >18.0

Table2:

Rate  |  Grade
0-7   |  <1.0
7-8   |  1
9-11  |  1.2
12-13 |  1.4
14-16 |  1.7
17-18 |  2
19-20 |  2.2
21    |  2.4
22-23 |  2.7
24-25 |  3
26    |  3.2
27    |  3.4
28    |  3.7
29    |  4
30    |  4.2
31    |  4.4
32    |  4.7
33-34 |  5
35    |  5.2
36    |  5.4
37    |  5.7
38    |  6
39    |  6.2
40    |  6.4
41    |  6.7
42    |  7
43    |  7.4
44    |  7.7
45    |  8
46    |  8.2
47    |  8.4
48    |  9
49    |  9.2
50    |  9.4
51    |  10.2
52    |  10.4
53    |  11
54    |  11.4
55    |  12
56    |  12.4
57-80 |  13

The newly generated table would have 3 columns (Rate, Age, Grade).

I tried the following script:

import pandas as pd

# Read tables into DataFrames
table1 = pd.read_excel('table1.xlsx')
table2 = pd.read_excel('table2.xlsx')

# Merge tables on 'Rate' column with outer join
merged_table = pd.merge(table1, table2, on='Rate', how='outer')

# Print or save merged table
merged_table.to_excel('merged.xlsx')

The output of this script is not what was expected:

    Rate    Age     Grade
0   21      8       2.4
1   26      8.9     3.2
2   27      9       3.4
3   28      9.3     3.7
4   29      9.6     4
5   30      9.9     4.2
6   31      10      4.4
7   32      10.3    4.7
8   33      10.6    
9   34      10.9    
10  35      11      5.2
11  36      11.3    5.4
12  37      11.6    5.7
13  38      11.9    6
14  39      12      6.2
15  40      12.3    6.4
16  41      12.6    6.7
17  42      12.9    7
18  43      13      7.4
19  44      13.3    7.7
20  45      13.6    8
21  46      13.9    8.2
22  47      14.6    8.4
23  48      15      9
24  49      15.3    9.2
25  50      15.6    9.4
26  51      16      10.2
27  52      16.6    10.4
28  53      17      11
29  54      17.3    11.4
30  55      17.6    12
31  56      18      12.4
32  0-4     <6.0    
33  0-7     <1.0
34  11-12   6.9 
35  12-13           1.4
36  13-14   7   
37  14-16           1.7
38  15-16   7.3 
39  17-18   7.6     2
40  19-20   7.9     2.2
41  22-23   8.3     2.7
42  24-25   8.6     3
43  33-34           5
44  5-6 6   
45  56-80   >18.0   13
46  7-8     6.3     1
47  9-10    6.6 
48  9-11            1.2
49          14      7.2
50          14      8.7
51          14      9.7
52          14      10
53          14      10.7
54          14      11.2
55          14      11.7
56          14      12.2
57          14      12.7
58          14.3    7.2
59          14.3    8.7
60          14.3    9.7
61          14.3    10
62          14.3    10.7
63          14.3    11.2
64          14.3    11.7
65          14.3    12.2
66          14.3    12.7
67          14.9    7.2
68          14.9    8.7
69          14.9    9.7
70          14.9    10
71          14.9    10.7
72          14.9    11.2
73          14.9    11.7
74          14.9    12.2
75          14.9    12.7
76          15.9    7.2
77          15.9    8.7
78          15.9    9.7
79          15.9    10
80          15.9    10.7
81          15.9    11.2
82          15.9    11.7
83          15.9    12.2
84          15.9    12.7
85          16.3    7.2
86          16.3    8.7
87          16.3    9.7
88          16.3    10
89          16.3    10.7
90          16.3    11.2
91          16.3    11.7
92          16.3    12.2
93          16.3    12.7
94          16.9    7.2
95          16.9    8.7
96          16.9    9.7
97          16.9    10
98          16.9    10.7
99          16.9    11.2
100         16.9    11.7
101         16.9    12.2
102         16.9    12.7
103         17.9    7.2
104         17.9    8.7
105         17.9    9.7
106         17.9    10
107         17.9    10.7
108         17.9    11.2
109         17.9    11.7
110         17.9    12.2
111         17.9    12.7

The desired output:

Rate    |    Age     |    Grade
0-4     |    <6.0    |    <1.0
5-6     |    6.0     |    <1.0
7-8     |    6.3     |    1
..      |    ..      |    ..

Example for testing:

import pandas as pd

table1 = pd.DataFrame({'Rate': ['0-4', '5-6', '7-8', '9-10', '11-12', '13-14'], 
                        'Age': ['<6.0', '6.0', '6.3', '6.6', '6.9', '7.0']})
table2 = pd.DataFrame({'Rate': ['0-7', '7-8', '9-11', '12-13', '14-16', '17-18'], 
                        'Grade': ['<1.0', '1', '1.2', '1.4', '1.7', '2']})

merged_table = pd.merge(table1, table2, on='Rate', how='outer')
merged_table.sort_values('Rate').to_excel('merged1.xlsx')

The output from that example:

    Rate    Age     Grade
0   0-4     <6.0    
1   0-7             <1.0
2   11-12   6.9 
3   12-13           1.4
4   13-14   7.0 
5   14-16           1.7
6   17-18           2
7   5-6     6.0 
8   7-8     6.3     1
9   9-10    6.6 
10  9-11            1.2

The desired output of that example:

    Rate    Age     Grade
0   0-4     <6.0    <1.0
1   5-6     6.0     <1.0    
2   7-8     6.3     1       
9   9-10    6.6     1.2 
2   11-12   6.9     1.4
4   13-14   7.0     1.7
5   15-16   7.3     1.7
6   17-18   7.6     2

So each row has values for the 3 columns.

Is that possible with pandas or other modules/libraries?


Solution

  • Example

    Your MRE does not appear to be adequate. I changed the example somewhat. (In particular, the part where 0-7 and 7-8 coexist in df2 seems wrong to me).

    import pandas as pd
    df1 = pd.DataFrame({'Rate': ['0-4', '5-6', '7-8', '9-10', '11-13', '14-16', '17'], 'Age': ['<6.0', '6.0', '6.3', '6.6', '6.9', '7.0', '7.3']})
    df2 = pd.DataFrame({'Rate': ['0-6', '7-8', '9-10', '11-12', '13-14', '15-16', '17'], 'Grade': ['<1.0', '1', '1.2', '1.4', '1.7', '2.0', '2.3']})
    

    Code

    use IntervalIndex to set the interval and then index to the maximum value of the interval to get the result. (because the grade appears to be assigned to the maximum value in the range).

    tmp = df2['Rate'].str.split('-', expand=True).ffill(axis=1).astype('int')
    idx = pd.IntervalIndex.from_arrays(tmp[0], tmp[1], closed='both')
    m = df2['Grade'].set_axis(idx)
    
    tmp1 = df1['Rate'].str.extract('(\d+)$')[0].astype('int')
    out = df1.assign(Grade=m[tmp1].values)
    

    enter image description here