Search code examples
raggregatetabularsummary

Creating Summary Tables with multiple variables in R


I'm trying to create a summary table to summarise data grouped by 4 variables. Ideally 2 variables will be plotted across the top and 2 across the side. I have attached a table I created with the data in Excel to give an idea of the intended final product.

Here is some code to provide the test data I am working with:

Zone <- c("Europe", "Europe", "Europe", "Europe", "USA", "USA", "Asia", 
"Asia", "Europe", "USA", "Europe", "Europe")
Type <- c(1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 2, 2)
Company <- c("Company1", "Company1", "Company4", "Company4", "Company2", 
"Company2", "Company2", "Company5", "Company5", "Company5", "Company1", 
"Company5")
Duration <- c(3, 3, 3, 1, 3, 2, 5, 5, 5, 4, 4, 4)
Value <- c(1000, 1000, 2000, 2000, 1500, 1300, 6000, 2000, 1000, 1200, 3000, 
3000)

df <- data.frame(Zone, Type, Company, Duration, Value)

I need to show a SUM of "value" split by Zone first, then by "Type" (to be plotted at the top of the table. Following this it should be further split by "Company" and then "Duration" (to be plotted on the left of the table). Please see the attached image! It is very important that all factors are shown even if the data on the row/column is all 0's or NA's.Example Image

Also, I need to replicate the exact same table but using a COUNT(length) function rather than SUM.

Any ideas would be greatly appreciated!


Solution

  • What makes this a little more complicated is that there are values you want represented in the solution that you don't have in the dataframe, such as all values for "Company3." My solution is to create an "anchor" data frame that contains all combinations of Company and Duration then left join a summary table to that. Finally, where values are NA, set to 0.

    library(dplyr)
    
    # Create anchor dataframe
    anchor <- data.frame(Company = rep(c("Company1","Company2","Company3","Company4","Company5"),each=5),
                     Duration = rep(c(1:5),5),
                     stringsAsFactors = F)
    # Summarize data
    summary <- df %>%
                 group_by(Zone,Type,Company,Duration) %>%
                 summarise(stat = sum(Value)) %>% # summarise as desired
                 ungroup() %>%
                 mutate(Zone.Type =  paste0(Zone,".",Type)) %>%
                 select(-Zone,-Type) %>%
                 spread(key = Zone.Type, value = stat, fill = 0)
    
    # Join the anchor to the summary
    final <- left_join(anchor,summary,by = c("Company","Duration")) %>%
               arrange(Company,Duration)
    
    # Set all NA to 0
    final[is.na(final)] <- 0
    

    This will give the following result:

        Company Duration Asia.1 Asia.2 Europe.1 Europe.2 USA.1 USA.2
    1  Company1        1      0      0        0        0     0     0
    2  Company1        2      0      0        0        0     0     0
    3  Company1        3      0      0     2000        0     0     0
    4  Company1        4      0      0        0     3000     0     0
    5  Company1        5      0      0        0        0     0     0
    6  Company2        1      0      0        0        0     0     0
    7  Company2        2      0      0        0        0     0  1300
    8  Company2        3      0      0        0        0  1500     0
    9  Company2        4      0      0        0        0     0     0
    10 Company2        5   6000      0        0        0     0     0
    11 Company3        1      0      0        0        0     0     0
    12 Company3        2      0      0        0        0     0     0
    13 Company3        3      0      0        0        0     0     0
    14 Company3        4      0      0        0        0     0     0
    15 Company3        5      0      0        0        0     0     0
    16 Company4        1      0      0     2000        0     0     0
    17 Company4        2      0      0        0        0     0     0
    18 Company4        3      0      0        0     2000     0     0
    19 Company4        4      0      0        0        0     0     0
    20 Company4        5      0      0        0        0     0     0
    21 Company5        1      0      0        0        0     0     0
    22 Company5        2      0      0        0        0     0     0
    23 Company5        3      0      0        0        0     0     0
    24 Company5        4      0      0        0     3000  1200     0
    25 Company5        5      0   2000     1000        0     0     0
    

    Data frames can't have merged cells like the excel-like output you provided, but I think this is a good approximation. Substitute the summary statistic of your choice where indicated above.