Search code examples
rvariance

Calculate aggregated variations from expected values (not std deviations)


I am am trying to use it to model the pricing accuracy of a group of sellers in a network.

My data set (pricing) looks like this:

  transactionID sellerID expectedprice actualprice pricediff
1          1001      251           200         210        10
2          1002      101           200         300       100
3          1003      251           400         190      -210
4          1004      251           300         300         0
5          1005      101           250         250         0
6          1006      350           200         210        10
7          1007      401           400         400         0

Note: I am not trying to do a standard deviation calculation since I am not trying to calculate variance from the mean, but rather variance from the expected value column which will differ depending on the transaction.

I am comfortable inserting new columns to get absolute variances from the expected value into the table using:

pricing$diffabs <- abs(pricing$pricediff)

which results in the following:

transactionID   sellerID    expectedprice   actualprice pricediff   diffabs
1001            251         200             210         10          10
1002            101         200             300         100         100
1003            251         400             190         -210            210
1004            251         300             300         0           0
1005            101         250             250         0           0
1006            350         200             210         10          10
1007            401         400             400         0           0

How does one then calculate a variance score for each seller which would be:

the sum of abs(pricing$diff) grouped at the "sellerID" divided by the number of observations (count) of "sellerID" in the data.

The output I would expect to be would be the following:

SellerID    Count   Sumofdiffabs    Variation
251         3       220             73.33333333
101         2       100             50
350         1       10              10
401         1       0               0

The other help topics that deal with variances in R at an aggregated level seem to only deal with standard deviation or variances from mean, such as this:

Calculating grouped variance from a frequency table in R

The aggregate function works well for me when using a simple function like standard deviation, but not where I have to figure out how to insert a count into function. What is throwing me off, is that my variance is a deviation not from the mean, but from a column result in my table.


Solution

  • m =  matrix(c(1001,251,200,210,10,1002,101,200,300,100,1003,251,400,190,-210,1004,251,300,300,0,1005,101,250,250,0,1006,350,200,210,10,1007,401,400,400,0),ncol = 5,nrow=7,byrow=TRUE)
    colnames(m) = c("transactionID","sellerID","expectedprice","actualprice","pricediff")
    pricing = as.data.frame(m)
    pricing$diffabs <- abs(pricing$pricediff)
    pricing
    
      transactionID sellerID expectedprice actualprice pricediff diffabs
               1001      251           200         210        10      10
               1002      101           200         300       100     100
               1003      251           400         190      -210     210
               1004      251           300         300         0       0
               1005      101           250         250         0       0
               1006      350           200         210        10      10
               1007      401           400         400         0       0
    

    So here is the result:

    library(data.table)
    pricing = as.data.table(pricing)
    f <- function(x) {list( Count=length(x))}
    result <- pricing[ , c(f(diffabs),     Sumofdiffabs=sum(diffabs),Variation=mean(diffabs)),by=sellerID]
    result
       sellerID Count Sumofdiffabs Variation
    1:      251     3          220  73.33333
    2:      101     2          100  50.00000
    3:      350     1           10  10.00000
    4:      401     1            0   0.00000