I'm trying to create a summary table to summarise data grouped by 4 variables. Ideally 2 variables will be plotted across the top and 2 across the side. I have attached a table I created with the data in Excel to give an idea of the intended final product.
Here is some code to provide the test data I am working with:
Zone <- c("Europe", "Europe", "Europe", "Europe", "USA", "USA", "Asia",
"Asia", "Europe", "USA", "Europe", "Europe")
Type <- c(1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 2, 2)
Company <- c("Company1", "Company1", "Company4", "Company4", "Company2",
"Company2", "Company2", "Company5", "Company5", "Company5", "Company1",
"Company5")
Duration <- c(3, 3, 3, 1, 3, 2, 5, 5, 5, 4, 4, 4)
Value <- c(1000, 1000, 2000, 2000, 1500, 1300, 6000, 2000, 1000, 1200, 3000,
3000)
df <- data.frame(Zone, Type, Company, Duration, Value)
I need to show a SUM of "value" split by Zone first, then by "Type" (to be plotted at the top of the table. Following this it should be further split by "Company" and then "Duration" (to be plotted on the left of the table). Please see the attached image! It is very important that all factors are shown even if the data on the row/column is all 0's or NA's.
Also, I need to replicate the exact same table but using a COUNT(length) function rather than SUM.
Any ideas would be greatly appreciated!
What makes this a little more complicated is that there are values you want represented in the solution that you don't have in the dataframe, such as all values for "Company3." My solution is to create an "anchor" data frame that contains all combinations of Company
and Duration
then left join a summary table to that. Finally, where values are NA
, set to 0
.
library(dplyr)
# Create anchor dataframe
anchor <- data.frame(Company = rep(c("Company1","Company2","Company3","Company4","Company5"),each=5),
Duration = rep(c(1:5),5),
stringsAsFactors = F)
# Summarize data
summary <- df %>%
group_by(Zone,Type,Company,Duration) %>%
summarise(stat = sum(Value)) %>% # summarise as desired
ungroup() %>%
mutate(Zone.Type = paste0(Zone,".",Type)) %>%
select(-Zone,-Type) %>%
spread(key = Zone.Type, value = stat, fill = 0)
# Join the anchor to the summary
final <- left_join(anchor,summary,by = c("Company","Duration")) %>%
arrange(Company,Duration)
# Set all NA to 0
final[is.na(final)] <- 0
This will give the following result:
Company Duration Asia.1 Asia.2 Europe.1 Europe.2 USA.1 USA.2
1 Company1 1 0 0 0 0 0 0
2 Company1 2 0 0 0 0 0 0
3 Company1 3 0 0 2000 0 0 0
4 Company1 4 0 0 0 3000 0 0
5 Company1 5 0 0 0 0 0 0
6 Company2 1 0 0 0 0 0 0
7 Company2 2 0 0 0 0 0 1300
8 Company2 3 0 0 0 0 1500 0
9 Company2 4 0 0 0 0 0 0
10 Company2 5 6000 0 0 0 0 0
11 Company3 1 0 0 0 0 0 0
12 Company3 2 0 0 0 0 0 0
13 Company3 3 0 0 0 0 0 0
14 Company3 4 0 0 0 0 0 0
15 Company3 5 0 0 0 0 0 0
16 Company4 1 0 0 2000 0 0 0
17 Company4 2 0 0 0 0 0 0
18 Company4 3 0 0 0 2000 0 0
19 Company4 4 0 0 0 0 0 0
20 Company4 5 0 0 0 0 0 0
21 Company5 1 0 0 0 0 0 0
22 Company5 2 0 0 0 0 0 0
23 Company5 3 0 0 0 0 0 0
24 Company5 4 0 0 0 3000 1200 0
25 Company5 5 0 2000 1000 0 0 0
Data frames can't have merged cells like the excel-like output you provided, but I think this is a good approximation. Substitute the summary statistic of your choice where indicated above.