I am trying to extrapolate forwards for missing values after the last non-missing value in a series using the growth rate given in another column.
This is a toy example below. My data is grouped according to product and region and should be arranged by year within grouping. I have sales data with some NA values. I am interested in using the World.growth rate to extrapolate the NA values for sales which appear after the last non-missing value in each series. So say for example Product = 1, Region = 1, I want the Interpolated.sales for 1973 to be 1972 value * World.growth value given in 1973.
Here is my data:
> library(openxlsx)
> library(dplyr)
>
> sales.data <- read.xlsx("\\Interpolate_sales.xlsx", sheet = 1)
>
> print(sales.data)
Product Region Year Sales World.growth Extrapolated.sales
1 1 1 1970 NA 0.88 NA
2 1 1 1971 141.906913 0.72 141.906913
3 1 1 1972 9.605398 1.01 9.605398
4 1 1 1973 NA 1.18 11.334370
5 1 1 1974 NA 0.84 9.520871
6 1 2 1970 60.062486 0.88 60.062486
7 1 2 1971 124.904150 0.72 124.904150
8 1 2 1972 NA 1.01 126.153191
9 1 2 1973 NA 1.18 148.860765
10 1 2 1974 NA 0.84 125.043043
11 1 2 1975 NA 1.23 153.802943
12 1 3 1970 63.298780 0.88 63.298780
13 1 3 1971 90.219126 0.72 90.219126
14 1 3 1972 107.271043 1.01 107.271043
15 1 3 1973 129.122561 1.18 129.122561
16 1 3 1974 NA 0.84 108.462951
This is my dplyr attempt (Is there perhaps a better way by writing a function or something? I guess this will also be handy if I want to backwards extrapolate missing values at the start of a series.)
> sales.data.extrapolated <- sales.data %>%
+ group_by(Product, Region) %>%
+ arrange(Product, Region, Year) %>%
+ mutate(Extrapolated.sales.manual = case_when(
+ row_number() > max(which(!is.na(Sales))) ~ lag(Sales) * World.growth,
+ TRUE ~ Sales
+ ))
>
> print(sales.data.extrapolated)
# A tibble: 16 × 7
# Groups: Product, Region [3]
Product Region Year Sales World.growth Extrapolated.sales Extrapolated.sales.manual
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1970 NA 0.88 NA NA
2 1 1 1971 142. 0.72 142. 142.
3 1 1 1972 9.61 1.01 9.61 9.61
4 1 1 1973 NA 1.18 11.3 11.3
5 1 1 1974 NA 0.84 9.52 NA
6 1 2 1970 60.1 0.88 60.1 60.1
7 1 2 1971 125. 0.72 125. 125.
8 1 2 1972 NA 1.01 126. 126.
9 1 2 1973 NA 1.18 149. NA
10 1 2 1974 NA 0.84 125. NA
11 1 2 1975 NA 1.23 154. NA
12 1 3 1970 63.3 0.88 63.3 63.3
13 1 3 1971 90.2 0.72 90.2 90.2
14 1 3 1972 107. 1.01 107. 107.
15 1 3 1973 129. 1.18 129. 129.
16 1 3 1974 NA 0.84 108. 108.
As you can see, this only calculates the value for the first missing value and none beyond.
This is a "reduction" problem and must be done step-wise, since "this row's" new value depends on the results of "last row's" value ... the normal vectorized operations in R don't know the updated value for the previous value until after the whole vector has been calculated.
For this, I'll use Reduce
, though purrr::accumulate
is nearly a drop-in replacement if you're already using purrr
.
sales.data %>%
group_by(Product, Region) %>%
mutate(
newcol = Reduce(function(prevval, ind) coalesce(Sales[ind], prevval * World.growth[ind]),
row_number(), init = NA, accumulate = TRUE)[-1]
) %>%
ungroup()
# # A tibble: 16 × 7
# Product Region Year Sales World.growth Extrapolated.sales newcol
# <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1970 NA 0.88 NA NA
# 2 1 1 1971 142. 0.72 142. 142.
# 3 1 1 1972 9.61 1.01 9.61 9.61
# 4 1 1 1973 NA 1.18 11.3 11.3
# 5 1 1 1974 NA 0.84 9.52 9.52
# 6 1 2 1970 60.1 0.88 60.1 60.1
# 7 1 2 1971 125. 0.72 125. 125.
# 8 1 2 1972 NA 1.01 126. 126.
# 9 1 2 1973 NA 1.18 149. 149.
# 10 1 2 1974 NA 0.84 125. 125.
# 11 1 2 1975 NA 1.23 154. 154.
# 12 1 3 1970 63.3 0.88 63.3 63.3
# 13 1 3 1971 90.2 0.72 90.2 90.2
# 14 1 3 1972 107. 1.01 107. 107.
# 15 1 3 1973 129. 1.18 129. 129.
# 16 1 3 1974 NA 0.84 108. 108.
Walkthrough:
Reduce
is called either nrows-1
times (if init=
is not provided) or nrows
if it is provided;prevval
is set to the init=
value we provided (NA
) and ind
is 1
(the first from row_number()
);
init=
completely, then on the first iteration, prevval
would be 1
and ind
would be 2
, the first two values in row_number()
; not what we wantprevval
is the previous row's Sales
, and ind
is the row number ... they are different "types" of numbers, not to be intermingled, which is why having prevval=1, ind=2
was a problemcoalesce(..)
calculates the expected (extrapolated) new value, but only uses it if Sales[ind]
is NA
; that is, coalesce
returns (vector-wise in general, though here it is always length-1) the first non-NA
value of the arguments providedcoalesce
becomes the prevval
on the next iteration of the anon-function (which ind
now being incremented, since we used row_number()
.Reduce
gives you the last value of the iterations, but we want every step of the way, so we add accumulate=TRUE
;init=
, the return vector is actually length nrows+1
, and we know that the first is a throw-away (because we used init=
), so we remove it with [-1]
The purrr::accumulate
version, if you're curious:
sales.data %>%
group_by(Product, Region) %>%
mutate(
newcol = purrr::accumulate(row_number(),
~ coalesce(Sales[.y], .x * World.growth[.y]), .init=NA)[-1]
) %>%
ungroup()
Moving the row_number()
as the first argument, using .init=
as we did init=
, and keeping [-1]
as before.