How to identify optimal item prices given a list of quantities and orders?

I have a set of orders where I have the total order value and item quantities but not the cost of individual items. I'm trying to back into what item prices would best explain these totals given the quantity for each order. Note that there are hundreds of items and thousands of orders but I've provided an example problem below.

So far I've tried creating a linear model to find weights but I think a direct optimization model may be a better approach so I've started using pulp to find item prices that best explain these totals but am running into some issues for formulating the optimization problem...

Order ID	Order Total	Item 1 Quantity	Item 2 Quantity	Item 3 Quantity
1	$15.00	1	0	2
2	$23.30	0	3	1
3	$17.50	2	0	0
4	$27.25	0	1	1

I'm creating LpVariables as:

itemPrices = [
    pulp.LpVariable(f"Item-{mi}", 0) for mi in sMenuItemIds
]


# Objective
prob += pulp.lpSum(
    itemVars
)

# ABS()
for _, frow in orderDataFrame.iterrows():
    prob += np.sum(frow[mi] * itemVars[sItemIndex[mi]] for mi in sMenuItemIds) >= frow['order_total']
    prob += np.sum(frow[mi] * itemVars[sItemIndex[mi]] for mi in sMenuItemIds) >=  -1 * frow['order_total']

When I run this problem all item values are set to $0.0 so I'm not accurately formulating the optimization problem. I'm trying to minimize the difference of SUM(Item Quantity * Item Price) and the order_total. Any suggestions on how to better approach this problem?

Solution

Here is a formulation and a few ideas that might help.

I popped your data into a .csv file and removed the "$" to make this work. Realize that the case you gave is infeasible (there are no EXACT prices that make those 4 equations work). The model should be formulated to allow for slack using the constraint as shown such that the expenditure is at least the Order Total and let the model try to minimize the prices.

I saw that you were thinking about ABS() for a difference maybe? I took the approach that the prices had to at least cover the order cost, so ABS() difference shouldn't need to be considered, unless there is more info you have.

I also monkeyed with an alternate objective (shown) to look at the "grand total" and minimize that, which comes out the same, and I think that is provably always the case, but I'd need to think on it... :)

I included a little checker at the end to see how well we did overall. If the records are legit/solvable, the model should come up with the same as the order total, it is a straightforward Linear Program.

Code:

import pandas as pd
import numpy as np
import pulp

df = pd.read_csv('pricer_data.csv')
df = df.set_index('Order ID')

prob = pulp.LpProblem('pricer')

# cheater for item ids... you could do something more creative here...
items = list(range(1,4))

# I find this easier (and clearer than making a list of vars...)
prices = pulp.LpVariable.dicts('price', items, lowBound=0)

# Objective
# This objective will give you the lowest sum of PRICES 
prob += pulp.lpSum(prices)

# lowest total cost objective
# prob += pulp.lpSum(df.iloc[r][f'Item {i} Quantity'] * prices[i] for i in items for r in range(len(df)))

# Constraints
for _, row in df.iterrows():
    prob += pulp.lpSum(row[f'Item {i} Quantity'] * prices[i] for i in items ) >= row['Order Total']

print(prob)
soln = prob.solve()
print(soln)
for item in prices.keys():
    print(f'item {item} price: {prices[item].varValue}')


# let's check the grand total...
grand_total = sum(df['Order Total'])

# make a quantity matrix
qty_matrix = df.reset_index().drop(columns=['Order ID', 'Order Total']).to_numpy()
print(qty_matrix)

price_array = np.array([prices[i].varValue for i in prices.keys()])
order_costs = qty_matrix.dot(price_array.T)
print(order_costs)

model_total_cost = np.sum(order_costs)
print(f'total from records: {grand_total}')
print(f'model fit costs: {model_total_cost}')

Output (unabridged... ):

pricer:
MINIMIZE
1*price_1 + 1*price_2 + 1*price_3 + 0
SUBJECT TO
_C1: price_1 + 2 price_3 >= 15

_C2: 3 price_2 + price_3 >= 23.3

_C3: 2 price_1 >= 17.5

_C4: price_2 + price_3 >= 27.25

VARIABLES
price_1 Continuous
price_2 Continuous
price_3 Continuous

Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

command line - /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pulp/solverdir/cbc/osx/64/cbc /var/folders/rt/5gc2md7s7y5bw8ddt74tz5vw0000gp/T/ec6ca9afa2b143fe845f38050ccae992-pulp.mps timeMode elapsed branch printingOptions all solution /var/folders/rt/5gc2md7s7y5bw8ddt74tz5vw0000gp/T/ec6ca9afa2b143fe845f38050ccae992-pulp.sol (default strategy 1)
At line 2 NAME          MODEL
At line 3 ROWS
At line 9 COLUMNS
At line 20 RHS
At line 25 BOUNDS
At line 26 ENDATA
Problem MODEL has 4 rows, 3 columns and 7 elements
Coin0008I MODEL read with 0 errors
Option for timeMode changed from cpu to elapsed
Presolve 3 (-1) rows, 3 (0) columns and 6 (-1) elements
0  Obj 8.75 Primal inf 38.141664 (3)
2  Obj 36
Optimal - objective value 36
After Postsolve, objective 36, infeasibilities - dual 0 (0), primal 0 (0)
Optimal objective 36 - 2 iterations time 0.002, Presolve 0.00
Option for printingOptions changed from normal to all
Total time (CPU seconds):       0.00   (Wallclock seconds):       0.00

1
item 1 price: 8.75
item 2 price: 24.125
item 3 price: 3.125
[[1 0 2]
 [0 3 1]
 [2 0 0]
 [0 1 1]]
[15.   75.5  17.5  27.25]
total from records: 83.05
model fit costs: 135.25