Search code examples
python-3.xpandasdataframenumpypython-itertools

Join two dataframes based on closest combination that sums up to a target value


Im trying to join below two dataframes based on closest combination of rows from df2 column Sales that sums up to target value in df1 column Total Sales, columns Name & Date in both dataframes should be the same when joining (as showed in expected output).

For Example : in df1 row number 0 should be matched only with df2 rows 0 & 1, since columns Name & Date is the same, Which is Name : John and Date : 2021-10-01.

df1 :

df1 = pd.DataFrame({"Name":{"0":"John","1":"John","2":"Jack","3":"Nancy","4":"Ahmed"},
                    "Date":{"0":"2021-10-01","1":"2021-11-01","2":"2021-10-10","3":"2021-10-12","4":"2021-10-30"},
                    "Total Sales":{"0":15500,"1":5500,"2":17600,"3":20700,"4":12000}})

    Name    Date        Total Sales
0   John    2021-10-01  15500
1   John    2021-11-01  5500
2   Jack    2021-10-10  17600
3   Nancy   2021-10-12  20700
4   Ahmed   2021-10-30  12000

df2 :

df2 = pd.DataFrame({"ID":{"0":"JO1","1":"JO2","2":"JO3","3":"JO4","4":"JA1","5":"JA2","6":"NA1",
                          "7":"NA2","8":"NA3","9":"NA4","10":"AH1","11":"AH2","12":"AH3","13":"AH3"},
                    "Name":{"0":"John","1":"John","2":"John","3":"John","4":"Jack","5":"Jack","6":"Nancy","7":"Nancy",
                            "8":"Nancy","9":"Nancy","10":"Ahmed","11":"Ahmed","12":"Ahmed","13":"Ahmed"},
                    "Date":{"0":"2021-10-01","1":"2021-10-01","2":"2021-11-01","3":"2021-11-01","4":"2021-10-10","5":"2021-10-10","6":"2021-10-12","7":"2021-10-12",
                            "8":"2021-10-12","9":"2021-10-12","10":"2021-10-30","11":"2021-10-30","12":"2021-10-30","13":"2021-10-29"},
                    "Sales":{"0":10000,"1":5000,"2":1000,"3":5500,"4":10000,"5":7000,"6":20000,
                             "7":100,"8":500,"9":100,"10":5000,"11":7000,"12":10000,"13":12000}})

    ID  Name    Date        Sales
0   JO1 John    2021-10-01  10000
1   JO2 John    2021-10-01  5000
2   JO3 John    2021-11-01  1000
3   JO4 John    2021-11-01  5500
4   JA1 Jack    2021-10-10  10000
5   JA2 Jack    2021-10-10  7000
6   NA1 Nancy   2021-10-12  20000
7   NA2 Nancy   2021-10-12  100
8   NA3 Nancy   2021-10-12  500
9   NA4 Nancy   2021-10-12  100
10  AH1 Ahmed   2021-10-30  5000
11  AH2 Ahmed   2021-10-30  7000
12  AH3 Ahmed   2021-10-30  10000
13  AH3 Ahmed   2021-10-29  12000

Expected Output :

    Name    Date        Total Sales Comb IDs            Comb Total
0   John    2021-10-01  15500       JO1, JO2            15000.0
1   John    2021-11-01  5500        JO4                 5500.0
2   Jack    2021-10-10  17600       JA1, JA2            17000.0
3   Nancy   2021-10-12  20700       NA1, NA2, NA3, NA4  20700.0
4   Ahmed   2021-10-30  12000       AH1, AH2            12000.0

What i have tried below is working for only one row at a time, but im not sure how to apply it in pandas dataframes to get the expected output.

Variable numbers in below script represent Sales column in df2, and variable target below represent Total Sales column in df1.

import itertools
import math

numbers = [1000, 5000, 3000]
target = 6000

best_combination = ((None,))
best_result = math.inf
best_sum = 0

for L in range(0, len(numbers)+1):
    for combination in itertools.combinations(numbers, L):
        sum = 0
        for number in combination:
            sum += number
        result = target - sum
        if abs(result) < abs(best_result):
            best_result = result
            best_combination = combination
            best_sum = sum

print("\nbest sum{} = {}".format(best_combination, best_sum))


[Out] best sum(1000, 5000) = 6000

Solution

  • Take the code you wrote which finds the best sum and turn it into a function (let's call it opt, which has parameters for target and a dataframe (which will be a subset of df2. It needs to return a list of IDs which correspond to the optimal combination.

    Write another function which takes 3 arguments name, date and target (let's call it calc). This function will filter df2 based on name and date, and pass it, along with the target to the opt function and return the result of that function. Finally, iterate through rows of df1, and call calc with the row arguments (or alternatively use pandas.DataFrame.apply