Search code examples
rdplyrreshapecontingency

Summing Contingency Tables in R with first column as character


My sales dataset includes 3 columns: Countries, Sales Type/Method, Total Quarterly Revenue. Here's a display of the first few rows for a better idea:

   Retailer.country Order.method.type    Qtr.Rev
         <fctr>            <fctr>      <dbl>
 1        Australia            E-mail  171407.28
 2        Australia       Sales visit 2013909.18
 3        Australia           Special  158795.34
 4        Australia         Telephone 2289201.87
 5        Australia               Web 1738303.59
 6          Austria       Sales visit   66926.18
 7          Austria         Telephone 1671887.40
 8          Austria               Web 7050164.50
 9          Belgium       Sales visit 1655507.05
10          Belgium               Web 6222440.26
etc.........

Here's the dput of this data:

    structure(list(Retailer.country = structure(c(1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 
7L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 
11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 14L, 
14L, 14L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 
17L, 18L, 18L, 19L, 19L, 19L, 19L, 19L, 20L, 20L, 20L, 20L, 20L, 
21L, 21L, 21L, 21L, 21L, 21L), .Label = c("Australia", "Austria", 
"Belgium", "Brazil", "Canada", "China", "Denmark", "Finland", 
"France", "Germany", "Italy", "Japan", "Korea", "Mexico", 
"Netherlands", 
"Singapore", "Spain", "Sweden", "Switzerland", "United Kingdom", 
"United States"), class = "factor"), Order.method.type = 
structure(c(1L, 
4L, 5L, 6L, 7L, 4L, 6L, 7L, 4L, 7L, 7L, 1L, 2L, 4L, 7L, 2L, 4L, 
6L, 7L, 4L, 7L, 4L, 7L, 2L, 4L, 6L, 7L, 1L, 3L, 4L, 7L, 1L, 2L, 
4L, 5L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 4L, 6L, 7L, 4L, 5L, 7L, 
2L, 3L, 6L, 7L, 2L, 5L, 6L, 7L, 2L, 3L, 6L, 7L, 1L, 7L, 2L, 4L, 
5L, 6L, 7L, 1L, 2L, 4L, 6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = 
c("E-mail", 
"Fax", "Mail", "Sales visit", "Special", "Telephone", "Web"), class = 
"factor"), 
    Qtr.Rev = c(171407.28, 2013909.18, 158795.34, 2289201.87, 
    1738303.59, 66926.18, 1671887.4, 7050164.5, 1655507.05, 
    6222440.26, 
    7746789.52, 6864270.12, 195549.5, 450628.79, 12376528.53, 
    415128.31, 1453194.14, 2735416.3, 15777880.11, 413978.16, 
    3776833.13, 308638.6, 12328172.97, 709194.65, 1304167.86, 
    5897377.14, 11048160.97, 1546079.43, 1247170.05, 2373591.15, 
    12102240.99, 2461322.51, 165800.42, 1397604.56, 198705.05, 
    7413833.64, 2662351.94, 289704.5, 680467.87, 87186.72, 343708.86, 
    1802166.73, 16990817.52, 2821127.32, 431860.34, 10144353.75, 
    5063353.42, 1725508.54, 3571760.87, 593828.88, 1074860.66, 
    2981026.86, 5254137.56, 469627.61, 908725.05, 1625096.56, 
    9677070.09, 88788.41, 337710.73, 254360.21, 7835117.44, 
    1292812.39, 
    4818848.86, 217936.39, 792168.42, 790344.28, 109161.04, 
    4565896.64, 
    697619.35, 264500.2, 189218.02, 2022968.96, 13756025.4, 
    1357389.56, 
    2352483.29, 2842600.85, 685752.21, 13437403.28, 29573813.7
    )), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-79L), .Names = c("Retailer.country", "Order.method.type", "Qtr.Rev"
))

I'm creating a contingency table in R that shows the quarterly revenue generated by each sales method for each country. The final output should look similar to this:

Retailer.country     E-mail        Fax      Mail Sales visit   Special  Telephone       Web  TOTAL.cn
1         Australia   171407.3       0.00       0.0  2013909.18  158795.3  2289201.9   1738304   6371617
2           Austria        0.0       0.00       0.0    66926.18       0.0  1671887.4   7050164   8788978
3           Belgium        0.0       0.00       0.0  1655507.05       0.0        0.0   6222440   7877947
4            Brazil        0.0       0.00       0.0        0.00       0.0        0.0   7746790   7746790
5            Canada  6864270.1  195549.50       0.0   450628.79       0.0        0.0  12376529  19886977
6             China        0.0  415128.31       0.0  1453194.14       0.0  2735416.3  15777880  20381619
7           Denmark        0.0       0.00       0.0   413978.16       
...
20   United Kingdom   697619.3  264500.20       0.0   189218.02       0.0  2022969.0  13756025  16930332
21    United States        0.0 1357389.56 2352483.3  2842600.85  685752.2 13437403.3  29573814  50249443
22       TOTAL.type 15695863.0 4767448.43 5692692.6 23233800.42 4811539.3 35257926.7 203769190 293228461

The cast() function from the reshape library gets most of the job done, and only leaves the summary column and row for all values to be computed.

cast(sales.by.country, Retailer.country ~ Order.method.type, 
fill=0) -> sales.by.country

Summing the rows into a new column named "TOTAL.cn" is pretty simple:

sales.by.country$TOTAL.cn <- rowSums(sales.by.country[,c(2:8)])

But summing the columns become a major headache, because the first component of that last row must either be a factor or character. I converted the first column "Retailer.country" to the character type, because it really serves as nothing more than a visual label.

After fussing around with several functions, this is the best code I was able to create to achieve the intended row summation:

# Sum the numeric columns, which is everything *except* column 1
total.by.ordertype <- (colSums(sales.by.country[,-1]))

# Create the Total by Order row
total.by.ordertype.row <- list("TOTAL.type", total.by.ordertype[1], 
total.by.ordertype[2], total.by.ordertype[3], total.by.ordertype[4], 
total.by.ordertype[5], total.by.ordertype[6], total.by.ordertype[7], 
total.by.ordertype[8])

# Add the Total by Order row to the bottom of the table
sales.by.country[22, ] <- total.by.ordertype.row

It works and maintains the proper data types among all columns... BUT I figure there has to be a more efficient way, perhaps by using the apply family of functions, something from dplyr, etc. Maybe the only way is to write my own function?

For example, future datasets may have 50+ different sales methods. When creating the list for the "Total by Order" row (above), I had to call out every single cell within the vector, separated by commas, in order to successfully add this to my existing table. Other efforts converted the data types of all the other columns to characters, which messes everything up.

I don't mind copying/pasting "total.by.ordertype" 8 times so much. But what happens when I'm dealing with 50-100 order types? Is there a cleaner method to reproduce any of this?

Thank you!


Solution

  • The cast() function from the reshape library can do the whole job. With the parameter margin = TRUE, all row and column totals will be computed:

    reshape::cast(sales.by.country, Retailer.country ~ Order.method.type, fun.aggregate = sum, 
         fill = 0, margins = TRUE)
    
       Retailer.country     E-mail        Fax      Mail Sales visit   Special  Telephone       Web     (all)
    1         Australia   171407.3       0.00       0.0  2013909.18  158795.3  2289201.9   1738304   6371617
    2           Austria        0.0       0.00       0.0    66926.18       0.0  1671887.4   7050164   8788978
    3           Belgium        0.0       0.00       0.0  1655507.05       0.0        0.0   6222440   7877947
    4            Brazil        0.0       0.00       0.0        0.00       0.0        0.0   7746790   7746790
    5            Canada  6864270.1  195549.50       0.0   450628.79       0.0        0.0  12376529  19886977
    6             China        0.0  415128.31       0.0  1453194.14       0.0  2735416.3  15777880  20381619
    7           Denmark        0.0       0.00       0.0   413978.16       0.0        0.0   3776833   4190811
    8           Finland        0.0       0.00       0.0   308638.60       0.0        0.0  12328173  12636812
    9            France        0.0  709194.65       0.0  1304167.86       0.0  5897377.1  11048161  18958901
    10          Germany  1546079.4       0.00 1247170.1  2373591.15       0.0        0.0  12102241  17269082
    11            Italy  2461322.5  165800.42       0.0  1397604.56  198705.0        0.0   7413834  11637266
    12            Japan  2662351.9  289704.50  680467.9    87186.72  343708.9  1802166.7  16990818  22856404
    13            Korea        0.0       0.00       0.0  2821127.32       0.0   431860.3  10144354  13397341
    14           Mexico        0.0       0.00       0.0  5063353.42 1725508.5        0.0   3571761  10360623
    15      Netherlands        0.0  593828.88 1074860.7        0.00       0.0  2981026.9   5254138   9903854
    16        Singapore        0.0  469627.61       0.0        0.00  908725.1  1625096.6   9677070  12680519
    17            Spain        0.0   88788.41  337710.7        0.00       0.0   254360.2   7835117   8515977
    18           Sweden  1292812.4       0.00       0.0        0.00       0.0        0.0   4818849   6111661
    19      Switzerland        0.0  217936.39       0.0   792168.42  790344.3   109161.0   4565897   6475507
    20   United Kingdom   697619.3  264500.20       0.0   189218.02       0.0  2022969.0  13756025  16930332
    21    United States        0.0 1357389.56 2352483.3  2842600.85  685752.2 13437403.3  29573814  50249443
    22            (all) 15695863.0 4767448.43 5692692.6 23233800.42 4811539.3 35257926.7 203769190 293228461
    

    Of course, fun.aggregate has to be specified as well.


    The same functionality is also available from the reshape2 package, the successor of reshape, but about 4 times faster for this small sample size.

    reshape2::dcast(sales.by.country, Retailer.country ~ Order.method.type, fun.aggregate = sum, 
                    fill = 0, margins = TRUE)
    

    dcast() is also available from the data.table package which claims to be faster than reshape2::dcast(). Unfortunately, the margins parameter hasn't been implemented yet (current CRAN version 1.10.4). Therefore, the margins have to be computed separately and combined with the original data:

    DT2 <- rbind(
      DT,
      DT[, .(Qtr.Rev = sum(Qtr.Rev)), by = Retailer.country],
      DT[, .(Qtr.Rev = sum(Qtr.Rev)), by = Order.method.type],
      DT[, .(Qtr.Rev = sum(Qtr.Rev))], 
      fill = TRUE
    )
    dcast(DT2, Retailer.country ~ Order.method.type, fill = 0)
    
        Retailer.country     E-mail        Fax      Mail Sales visit   Special  Telephone       Web        NA
     1:        Australia   171407.3       0.00       0.0  2013909.18  158795.3  2289201.9   1738304   6371617
     2:          Austria        0.0       0.00       0.0    66926.18       0.0  1671887.4   7050164   8788978
     3:          Belgium        0.0       0.00       0.0  1655507.05       0.0        0.0   6222440   7877947
     4:           Brazil        0.0       0.00       0.0        0.00       0.0        0.0   7746790   7746790
     5:           Canada  6864270.1  195549.50       0.0   450628.79       0.0        0.0  12376529  19886977
     6:            China        0.0  415128.31       0.0  1453194.14       0.0  2735416.3  15777880  20381619
     7:          Denmark        0.0       0.00       0.0   413978.16       0.0        0.0   3776833   4190811
     8:          Finland        0.0       0.00       0.0   308638.60       0.0        0.0  12328173  12636812
     9:           France        0.0  709194.65       0.0  1304167.86       0.0  5897377.1  11048161  18958901
    10:          Germany  1546079.4       0.00 1247170.1  2373591.15       0.0        0.0  12102241  17269082
    11:            Italy  2461322.5  165800.42       0.0  1397604.56  198705.0        0.0   7413834  11637266
    12:            Japan  2662351.9  289704.50  680467.9    87186.72  343708.9  1802166.7  16990818  22856404
    13:            Korea        0.0       0.00       0.0  2821127.32       0.0   431860.3  10144354  13397341
    14:           Mexico        0.0       0.00       0.0  5063353.42 1725508.5        0.0   3571761  10360623
    15:      Netherlands        0.0  593828.88 1074860.7        0.00       0.0  2981026.9   5254138   9903854
    16:        Singapore        0.0  469627.61       0.0        0.00  908725.1  1625096.6   9677070  12680519
    17:            Spain        0.0   88788.41  337710.7        0.00       0.0   254360.2   7835117   8515977
    18:           Sweden  1292812.4       0.00       0.0        0.00       0.0        0.0   4818849   6111661
    19:      Switzerland        0.0  217936.39       0.0   792168.42  790344.3   109161.0   4565897   6475507
    20:   United Kingdom   697619.3  264500.20       0.0   189218.02       0.0  2022969.0  13756025  16930332
    21:    United States        0.0 1357389.56 2352483.3  2842600.85  685752.2 13437403.3  29573814  50249443
    22:               NA 15695863.0 4767448.43 5692692.6 23233800.42 4811539.3 35257926.7 203769190 293228461
        Retailer.country     E-mail        Fax      Mail Sales visit   Special  Telephone       Web        NA