I am am trying to use it to model the pricing accuracy of a group of sellers in a network.
My data set (pricing) looks like this:
transactionID sellerID expectedprice actualprice pricediff
1 1001 251 200 210 10
2 1002 101 200 300 100
3 1003 251 400 190 -210
4 1004 251 300 300 0
5 1005 101 250 250 0
6 1006 350 200 210 10
7 1007 401 400 400 0
Note: I am not trying to do a standard deviation calculation since I am not trying to calculate variance from the mean, but rather variance from the expected value column which will differ depending on the transaction.
I am comfortable inserting new columns to get absolute variances from the expected value into the table using:
pricing$diffabs <- abs(pricing$pricediff)
which results in the following:
transactionID sellerID expectedprice actualprice pricediff diffabs
1001 251 200 210 10 10
1002 101 200 300 100 100
1003 251 400 190 -210 210
1004 251 300 300 0 0
1005 101 250 250 0 0
1006 350 200 210 10 10
1007 401 400 400 0 0
How does one then calculate a variance score for each seller which would be:
the sum of abs(pricing$diff)
grouped at the "sellerID" divided by the number of observations (count) of "sellerID" in the data.
The output I would expect to be would be the following:
SellerID Count Sumofdiffabs Variation
251 3 220 73.33333333
101 2 100 50
350 1 10 10
401 1 0 0
The other help topics that deal with variances in R at an aggregated level seem to only deal with standard deviation or variances from mean, such as this:
Calculating grouped variance from a frequency table in R
The aggregate function works well for me when using a simple function like standard deviation, but not where I have to figure out how to insert a count into function. What is throwing me off, is that my variance is a deviation not from the mean, but from a column result in my table.
m = matrix(c(1001,251,200,210,10,1002,101,200,300,100,1003,251,400,190,-210,1004,251,300,300,0,1005,101,250,250,0,1006,350,200,210,10,1007,401,400,400,0),ncol = 5,nrow=7,byrow=TRUE)
colnames(m) = c("transactionID","sellerID","expectedprice","actualprice","pricediff")
pricing = as.data.frame(m)
pricing$diffabs <- abs(pricing$pricediff)
pricing
transactionID sellerID expectedprice actualprice pricediff diffabs
1001 251 200 210 10 10
1002 101 200 300 100 100
1003 251 400 190 -210 210
1004 251 300 300 0 0
1005 101 250 250 0 0
1006 350 200 210 10 10
1007 401 400 400 0 0
So here is the result:
library(data.table)
pricing = as.data.table(pricing)
f <- function(x) {list( Count=length(x))}
result <- pricing[ , c(f(diffabs), Sumofdiffabs=sum(diffabs),Variation=mean(diffabs)),by=sellerID]
result
sellerID Count Sumofdiffabs Variation
1: 251 3 220 73.33333
2: 101 2 100 50.00000
3: 350 1 10 10.00000
4: 401 1 0 0.00000