I've got a tibble of data in long format where some observations are missing --- think sales of different articles on different dates, where an article might not have been sold on a given day at all.
I want to complete()
this data so that all combinations of article and date are present in the tibble (this is for creating area plots, since ggplot
's geom_area()
apparently averages neighboring values for the missing data otherwise).
The original tibble contains both an ID number and a textual label for each article, and I would like to also add the same textual labels to the complete()
d tibble (this is so that I'll then be able to glue()
together a more informative label for a subsequent plot); I'm unsure how to do this in a tidy manner, however.
The mapping of ID numbers to labels is available in another tibble.
Here is an artificial example:
library(tidyverse)
articleData = tibble(article = 1:3, description = c("Foo", "Bar", "Baz"), price = c(1, 2, 2))
article <- c(1:3, 1, 3, 1:3)
description <- map_chr(article, ~ articleData[[., "description"]])
date <- as.Date(c(rep("2023-01-01", times = 3), rep("2023-01-02", times = 2), rep("2023-01-03", times = 3)))
sales <- c(21, 58, 35, 32, 47, 25, 47, 42)
price <- map_int(article, ~ articleData[[., "price"]])
data <- tibble(
Date = date,
Article = as_factor(article),
ArticleName = description,
Sales = sales,
Price = price,
Turnover = sales * price
)
data
data %>%
group_by(Date) %>%
summarize(
Date = unique(Date),
TurnoverTotal = sum(Turnover),
SalesTotal = sum(Sales)
) %>%
full_join(data) %>%
mutate(
TurnoverShare = Turnover / TurnoverTotal,
SalesShare = Sales / SalesTotal
) ->
data
data
# This does not work.
data %>% ggplot(mapping = aes(x = Date, y = TurnoverShare, fill = Article)) +
geom_area()
# Explicitly set missing shares to zero.
data %>% complete(
Date,
Article,
fill = list(
SalesShare = 0,
TurnoverShare = 0
)
) -> data2
data2
# This works.
data2 %>% ggplot(mapping = aes(x = Date, y = TurnoverShare, fill = Article)) +
geom_area()
In this example, article 2 (the "Bar") was not sold no 2023-01-02, so ArticleName
is NA
in the relevant row in data2
. Is there a way to directly set ArticleName
based on Article
within the call to complete()
, using the articleData
tibble? Alternatively, is there a good (tidy) way of adding this afterward?
Thank you.
You can achieve this using complete()
provided that the names and types match in both data frames (complete()
is powered in part by a join afterall).
library(dplyr)
library(tidyr)
# Name and types must match in x and y data frames
articleData <- tibble(article = 1:3, description = c("Foo", "Bar", "Baz"), price = c(1, 2, 2)) %>%
rename(Article = article, ArticleName = description, Price = price) %>%
mutate(Article = factor(Article))
data %>%
complete(
Date,
articleData,
fill = list(
SalesShare = 0,
TurnoverShare = 0
)
)
# A tibble: 9 × 10
Date Article ArticleName Price TurnoverTotal SalesTotal Sales Turnover TurnoverShare SalesShare
<date> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2023-01-01 1 Foo 1 207 114 21 21 0.101 0.184
2 2023-01-01 2 Bar 2 207 114 58 116 0.560 0.509
3 2023-01-01 3 Baz 2 207 114 35 70 0.338 0.307
4 2023-01-02 1 Foo 1 126 79 32 32 0.254 0.405
5 2023-01-02 2 Bar 2 NA NA NA NA 0 0
6 2023-01-02 3 Baz 2 126 79 47 94 0.746 0.595
7 2023-01-03 1 Foo 1 203 114 25 25 0.123 0.219
8 2023-01-03 2 Bar 2 203 114 47 94 0.463 0.412
9 2023-01-03 3 Baz 2 203 114 42 84 0.414 0.368