Search code examples
ranalyticssummary

Summing rows by index variable in R


I am working with a database of all transit systems in America and trying to compare different agencies. Each case is a specific part of an organization. For example, bus lines are separate from subways. I would like to combine the values for all of the cases for a given agency.

Basically I want to sum the value of each column for each "Trs_Id" and delete the rest. This data frame is the breakdown of operating expenses ("opex"). Here is what my dataset looks like in R:

> colnames(opex)
[1] "Trs_Id"                  "Mode_Cd"                
[3] "Service_Cd"              "Expense_Category_Desc"  
[5] "Op_Sal_Wage_Amt"         "Other_Sal_Wage_Amt"     
[7] "Fringe_Benefit_Amt"      "Service_Costs_Amt"      
[9] "Fuel_Lubricant_Amt"      "Tire_Tube_Amt"          
[11] "Other_Mat_Sup_Amt"       "Utility_Amt"            
[13] "Casuality_Liability_Amt" "Tax_Amt"                
[15] "In_Report_Amt"           "Sep_Report_Amt"         
[17] "Misc_Expense_Amt"        "Expense_Transfer_Amt"   
[19] "Ada_Related_Amt"        
> NROW(opex)
[1] 6956
> id_nm <- table(opex$Trs_Id)
> NROW(id_nm)
[1] 616

Solution

  • In spite of you didn't make a reproducible example that demonstrate your problem, I think you can solve it using ddply function from plyr package, but you can also use the base function aggregate. I prefer do it with ddply.

    Here's some random numbers for figuring out your data.frame

     #install.packages('plyr')
     library(plyr)
    
     set.seed(007) # for the example being reproducible
     Mode_Cd1 <- replicate(4,sample(LETTERS[1:26], 1, replace=T)) # random genereation of variable Mode_Cd
     Mode_Cd2 <- replicate(4,sample(LETTERS[1:26], 1, replace=T)) 
    
     data_set <- data.frame(Trs_Id = rep(paste('00', 1:4, sep=''), each=3),
                Mode_Cd = sample(paste(Mode_Cd1, Mode_Cd2, sep=''), 12, replace=T),
                Service_Cd =  sample(paste(Mode_Cd2, Mode_Cd1, sep=''), 12, replace=T),
                Op_Sal_Wage_Amt = rnorm(12,5000,100),
                Other_Sal_Wage = rnorm(12,3000,800))
    
     data_set # this is how my random data_set looks like
       Trs_Id Mode_Cd Service_Cd Op_Sal_Wage_Amt Other_Sal_Wage
    1     001      ZG         ID        4910.620       2213.558
    2     001      KU         UK        4969.267       2779.149
    3     001      ZG         ZB        4999.518       2303.319
    4     002      ZG         ZB        5098.816       3574.968
    5     002      BZ         ZB        5083.975       3088.522
    6     002      ZG         GZ        5070.534       2937.227
    7     003      KU         ID        5130.596       2663.608
    8     003      ZG         UK        4861.200       2550.299
    9     003      DI         ZB        5127.292       3798.011
    10    004      ZG         UK        5018.419       2115.896
    11    004      BZ         ID        5075.228       2886.170
    12    004      KU         UK        5059.175       3251.996
    
     ddply(data_set,.(Trs_Id),numcolwise(sum)) # The sum you want.
      Trs_Id Op_Sal_Wage_Amt Other_Sal_Wage
    1    001        14879.40       7296.026
    2    002        15253.33       9600.717
    3    003        15119.09       9011.918
    4    004        15152.82       8254.062