Search code examples
rloopsdata.tableapplypurrr

calculate the influence of one observation on overall result in R


I have a overview table - list of item count, the actual cost and the predicted cost

myData <- data.table("itemCount" = c(3000, 20, 50, 9),
                     "cost" = c(120, 118, 165, 93), 
                     "prediction" = c(120, 100, 150, 120))

Then I calculate the individual and overall profit:

myData[, "profit" := cost/prediction]

total <- myData[, .(itemsTotal = sum(itemCount),
                costTotal  = sum(cost), 
                predictionTotal = sum(prediction))][
                  , "profit" := costTotal/predictionTotal 
                ]

Now, for every row, I want to calculate what the overall profit would have been if that particular row was excluded from the analysis. For Example if row two was missing:

myData$diffinProfit <- NA
      myDataEx <- myData[- 2, ]
      totalEx <- myDataEx[, .(itemsTotal = sum(itemCount),
                              costTotal  = sum(cost), 
                              predictionTotal = sum(prediction))][
                                , "profit" := costTotal/predictionTotal 

so I wrote a for loop to do this

myData$diffinProfit <- NA
for(observation in seq_along(length(myData)-1)){
  
  myDataEx <- myData[- observation, ]
  totalEx <- myDataEx[, .(itemsTotal = sum(itemCount),
                          costTotal  = sum(cost), 
                          predictionTotal = sum(prediction))][
                            , "profit" := costTotal/predictionTotal 
                            ]
  
  myData$diffinProfit[[observation]] <- totalEx$profit
  
}

However, I only get result for the first observation. How can I find the for loop? Its there any way I could use an apply function? I was considering mapply? or maybe a purrr function?


Solution

  • The first problem you have is that length(myData) is reporting the number of columns, not the number of rows. But I think we can do without the for loop (though sapply is similar to it in deeper code).

    myData[, otherProfit := sapply(seq_len(.N), function(z) sum(cost[-z])/sum(prediction[-z]))]
    myData
    #    itemCount  cost prediction profit otherProfit
    #        <num> <num>      <num>  <num>       <num>
    # 1:      3000   120        120  1.000   1.0162162
    # 2:        20   118        100  1.180   0.9692308
    # 3:        50   165        150  1.100   0.9735294
    # 4:         9    93        120  0.775   1.0891892
    

    Although mathematically, it's possible to do it without a loop at all:

    sumcost <- sum(myData$cost)
    sumpred <- sum(myData$prediction)
    myData[, profit2 := (sumcost-cost)/(sumpred-prediction)]
    myData
    #    itemCount  cost prediction profit otherProfit   profit2
    #        <num> <num>      <num>  <num>       <num>     <num>
    # 1:      3000   120        120  1.000   1.0162162 1.0162162
    # 2:        20   118        100  1.180   0.9692308 0.9692308
    # 3:        50   165        150  1.100   0.9735294 0.9735294
    # 4:         9    93        120  0.775   1.0891892 1.0891892
    

    I'm not going to benchmark 4 rows, but I'd be surprised if this second "vectorized" approach isn't more efficient than the sapply above or a for-loop alternative.