Search code examples
rdataframematrixmeanweighted

wieghted mean on multiple columns for all rows


I want to calculate a weighted mean of a huge dataset.

What I need is the following (for each row) and I have NAs, so I need to somehow incorporate na.rm = TRUE. I want the following to be calculated (for distance 1 to distance 10):

(distance1 * X1CityNumber + ... + distance10 * X10CityNumber) /
(X1CityNumber + ... + X10CityNumber)

I wrote the following code, but it is producing wrong numbers.

for (i in 1:378742) {
  rcffull$distance[i] <- weighted.mean(cbind(rcffull$distance1[i],
                                             rcffull$distance2[i],
                                             rcffull$distance3[i],
                                             rcffull$distance4[i],
                                             rcffull$distance5[i],
                                             rcffull$distance6[i],
                                             rcffull$distance7[i],
                                             rcffull$distance8[i],
                                             rcffull$distance9[i],
                                             rcffull$distance10[i]),
                                       cbind(rcffull$X1CityNumber[i],
                                             rcffull$X2CityNumber[i],
                                             rcffull$X3CityNumber[i],
                                             rcffull$X4CityNumber[i],
                                             rcffull$X5CityNumber[i],
                                             rcffull$X6CityNumber[i],
                                             rcffull$X7CityNumber[i],
                                             rcffull$X8CityNumber[i],
                                             rcffull$X9CityNumber[i],
                                             rcffull$X10CityNumber[i]),
                                       na.rm = TRUE)
  }

Any suggestions?


sample data with with fewer columns:

 distance1    Weights1    distance2        Weights2    
1    5            3            8              2 
2    NA           2            3              3
3    5            NA           4              4

#desired output:
    Mean distance
1      6.2 #= (5 * 3 + 8 * 2) / (3 + 2)
2      3.0 #= (3 * 3) / 3
3      3.0 #= (4 * 4) / 4

Solution

  • NA happens in both weights and distances. When doing (d1 * w1 + d2 * w2) / (w1 + w2), NA should be eliminated from both nominator and denominator and normalization of weights needs account for this.

    dat <- structure(list(distance1 = c(5L, NA, 5L), Weights1 = c(3L, 2L, NA),
    distance2 = c(8L, 3L, 4L), Weights2 = c(2L, 3L, 4L)), .Names = c("distance1", 
    "Weights1", "distance2", "Weights2"), class = "data.frame", row.names = c("1", 
    "2", "3"))
    
    A <- as.matrix(dat[c(1, 3)])  ## distance columns
    B <- as.matrix(dat[c(2, 4)])  ## weight columns
    B[is.na(A)] <- 0
    rowSums(A * B, na.rm = TRUE) / rowSums(B, na.rm = TRUE)
    #  1   2   3 
    #6.2 3.0 4.0 
    

    Remark 1:

    If there is no NA in neither data and weights, just do

    rowSums(A * B) / rowSums(B)
    

    Remark 2:

    Alternative way to deal with NA: set all NA in both data and weights to 0, then use rowSums without na.rm:

    ind <- is.na(A) | is.na(B)
    A[ind] <- 0
    B[ind] <- 0
    rowSums(A * B) / rowSums(B)
    

    Remark 3:

    NaN can happen due to 0 / 0, if there is no pair of non-NA datum and non-NA weight.

    Remark 4:

    weighted.mean can only remove NAs in data, not in weights. It is also undesired, as you want to do calculation for all rows. There is no "vectorized" solution with it; you have to do a slowish R-level loop.