Search code examples
raggregateweighted-average

Aggregating by indeces and reweighting in R


I have a ton of Price data that is indexed by State, Date, and UPC (Product Code). I want to aggregate away UPC, and combine the Prices by taking a weighted average. I will try to explain it, but you may just want to read the code below.

Each observation in the dataset is: UPC, date, state, price, and weight. I would like to aggregate away the UPC index in this way:

Take all data points with the same date and state, and multiple their prices by their weights and sum them up. This obviously creates a weighted average, which I call the priceIndex. However, for some date&state combos the weights do not add up to 1. Therefore, I want to create two additional columns: One for the total of the weights for each date&state combo. The second is for a reweighted average: I.e., if the original two weights were .5 and .3, change them to .5/(.5+.3)=.625 and .3/(.5+.3)=.375, and then recalculate the weighted average into another price index.

This is what I mean:

upc=c(1153801013,1153801013,1153801013,1153801013,1153801013,1153801013,2105900750,2105900750,2105900750,2105900750,2105900750,2173300001,2173300001,2173300001,2173300001)
date=c(200601,200602,200603,200603,200601,200602,200601,200602,200603,200601,200602,200601,200602,200603,200601)
price=c(26,28,27,27,23,24,85,84,79.5,81,78,24,19,98,47)
state=c(1,1,1,2,2,2,1,1,2,2,2,1,1,1,2)
weight=c(.3,.2,.6,.4,.4,.5,.5,.5,.45,.15,.5,.2,.15,.3,.45)

# This is what I have:
data <- data.frame(upc,date,state,price,weight)
data

# These are a few of the weighted calculations:
# .3*26+85*.5+24*.2 = 55.1
# 28*.2+84*.5+19*.15 = 50.45
# 27*.6+98*.3 = 45.6
# Etc. etc.

# Here is the reweighted calculation for date=200602 & state==1:
# 28*(.2/.85)+84*(.5/.85)+19*(.15/.85) = 50.45
# Or, equivalently:
# (28*.2+84*.5+19*.15)/.85 = 50.45

# This is what I want:
date=c(200601,200602,200603,200601,200602,200603)
state=c(1,1,1,2,2,2)
priceIndex=c(55.1,50.45,45.6,42.5,51,46.575)
totalWeight=c(1,.85,.9,1,1,.85)
reweightedIndex=c(55.1,59.35294,50.66667,42.5,51,54.79412)
index <- data.frame(date,state,priceIndex,totalWeight,reweightedIndex)
index

Also, not that it should matter, but there are about 35 states, 150 UPCs, and 84 dates in the dataset -- so there are a lot of observations.

Thanks a lot in advance.


Solution

  • We can use one of the group by summarise operation. With data.table, we convert the 'data.frame' to 'data.table' (setDT(data), grouped by 'date', 'state', we get the sum of product of 'price' and 'weight', and sum(weight) as temporary variables, and then create the 3 variables in the list based on that.

    library(data.table) 
    setDT(data)[, {tmp1 = sum(price*weight)
                    tmp2 = sum(weight)
            list(priceIndex=tmp1, totalWeight=tmp2,
                  reweigthedIndex = tmp1/tmp2)}, .(date, state)]
    #    date state priceIndex totalWeight reweightedIndex
    #1: 200601     1     55.100        1.00        55.10000
    #2: 200602     1     50.450        0.85        59.35294
    #3: 200603     1     45.600        0.90        50.66667
    #4: 200603     2     46.575        0.85        54.79412
    #5: 200601     2     42.500        1.00        42.50000
    #6: 200602     2     51.000        1.00        51.00000
    

    Or using dplyr, we can use summarise to create the 3 columns after doing grouping by 'date' and 'state'.

    library(dplyr)
    data %>% 
      group_by(date, state) %>% 
      summarise(priceIndex = sum(price*weight),
                totalWeight = sum(weight),
                reweightedIndex = priceIndex/totalWeight)
    #   date state priceIndex totalWeight reweightedIndex
    #   (dbl) (dbl)      (dbl)       (dbl)           (dbl)
    #1 200601     1     55.100        1.00        55.10000
    #2 200601     2     42.500        1.00        42.50000
    #3 200602     1     50.450        0.85        59.35294
    #4 200602     2     51.000        1.00        51.00000
    #5 200603     1     45.600        0.90        50.66667
    #6 200603     2     46.575        0.85        54.79412