I am trying to interpolate missing NA values within a data frame that is in "long-format," but am having trouble. Here is an example:
library(tidyverse)
library(zoo)
#step 1 - write data frame
df <- tibble(day=1:10,
walmart = c(1, 2, 3, 4, NA, NA, 7, 8 , 9, 10),
amazon= c(10, 15, 20, NA, 30, 35, 40, NA, 50, 55))
#step 2 - convert to long format
df <- df %>%
pivot_longer(cols= c(walmart, amazon),
names_to = "company",
values_to = "sales")
#step 3 - perform interpolation
df <- df %>%
mutate(sales=na.approx(sales))
Evidently, the interpolation is not taking into account the two different stores. I would like to interpolate the missing values for walmart based on the surrounding values for walmart only, and the missing values for amazon based on the surrounding values for amazon only. I do recognize that performing the interpolation with the data in wide format (as it is after step 1) on each column (in this case walmart and amazon) is possible, but I am trying to avoid doing this with my actual data because a) there are thousands of "companies" and b) I want to interpolate several variables so by transposing the data I would just put myself back in the same position. Moreover, I don't thinking using group_by(company) will work because that yields a size error with na.approx. Is there a way to resolve this? Thank you very much.
Group by company
(not sales
):
df %>%
group_by(company) %>%
mutate(sales=zoo::na.approx(sales)) %>%
ungroup()
# # A tibble: 20 x 3
# day company sales
# <int> <chr> <dbl>
# 1 1 walmart 1
# 2 1 amazon 10
# 3 2 walmart 2
# 4 2 amazon 15
# 5 3 walmart 3
# 6 3 amazon 20
# 7 4 walmart 4
# 8 4 amazon 25
# 9 5 walmart 5
# 10 5 amazon 30
# 11 6 walmart 6
# 12 6 amazon 35
# 13 7 walmart 7
# 14 7 amazon 40
# 15 8 walmart 8
# 16 8 amazon 45
# 17 9 walmart 9
# 18 9 amazon 50
# 19 10 walmart 10
# 20 10 amazon 55
Since you're running into errors in your use, that suggests that you have NA
s on either end of a company's set of values. For instance,
df$amazon[10] <- NA
df
# # A tibble: 10 x 3
# day walmart amazon
# <int> <dbl> <dbl>
# 1 1 1 10
# 2 2 2 15
# 3 3 3 20
# 4 4 4 NA
# 5 5 NA 30
# 6 6 NA 35
# 7 7 7 40
# 8 8 8 NA
# 9 9 9 50
# 10 10 10 NA
df <- df %>%
pivot_longer(cols= c(walmart, amazon),
names_to = "company",
values_to = "sales")
df %>%
group_by(company) %>%
mutate(sales=zoo::na.approx(sales)) %>%
ungroup()
Error in `mutate()`:
# ! Problem while computing `sales = zoo::na.approx(sales)`.
# x `sales` must be size 10 or 1, not 9.
# i The error occurred in group 1: company = "amazon".
# Run `rlang::last_error()` to see where the error occurred.
The issue with this is that zoo::na.approx
is shortening your vector. Consider this:
zoo::na.approx(c(1,NA,3))
# [1] 1 2 3
zoo::na.approx(c(1,NA,3,NA))
# [1] 1 2 3
zoo::na.approx(c(1,NA,3,NA), na.rm=FALSE)
# [1] 1 2 3 NA
So add na.rm=FALSE
to your call to na.approx
:
df %>%
group_by(company) %>%
mutate(sales=zoo::na.approx(sales, na.rm = FALSE)) %>%
ungroup()
# A tibble: 20 x 3
# day company sales
# <int> <chr> <dbl>
# 1 1 walmart 1
# 2 1 amazon 10
# 3 2 walmart 2
# 4 2 amazon 15
# 5 3 walmart 3
# 6 3 amazon 20
# 7 4 walmart 4
# 8 4 amazon 25
# 9 5 walmart 5
# 10 5 amazon 30
# 11 6 walmart 6
# 12 6 amazon 35
# 13 7 walmart 7
# 14 7 amazon 40
# 15 8 walmart 8
# 16 8 amazon 45
# 17 9 walmart 9
# 18 9 amazon 50
# 19 10 walmart 10
# 20 10 amazon NA
Note that since this is doing interpolation, the NA
values outside of the known range will not be replaced. If you need that, consider what form of extrapolation you need.