Search code examples
rdplyrpanelextrapolation

Extrapolating in R using growth rates from another column (using dplyr)


I am trying to extrapolate forwards for missing values after the last non-missing value in a series using the growth rate given in another column.

This is a toy example below. My data is grouped according to product and region and should be arranged by year within grouping. I have sales data with some NA values. I am interested in using the World.growth rate to extrapolate the NA values for sales which appear after the last non-missing value in each series. So say for example Product = 1, Region = 1, I want the Interpolated.sales for 1973 to be 1972 value * World.growth value given in 1973.

Here is my data:

> library(openxlsx)
> library(dplyr)
> 
> sales.data <- read.xlsx("\\Interpolate_sales.xlsx", sheet = 1)
> 
> print(sales.data)
   Product Region Year      Sales World.growth Extrapolated.sales
1        1      1 1970         NA         0.88                 NA
2        1      1 1971 141.906913         0.72         141.906913
3        1      1 1972   9.605398         1.01           9.605398
4        1      1 1973         NA         1.18          11.334370
5        1      1 1974         NA         0.84           9.520871
6        1      2 1970  60.062486         0.88          60.062486
7        1      2 1971 124.904150         0.72         124.904150
8        1      2 1972         NA         1.01         126.153191
9        1      2 1973         NA         1.18         148.860765
10       1      2 1974         NA         0.84         125.043043
11       1      2 1975         NA         1.23         153.802943
12       1      3 1970  63.298780         0.88          63.298780
13       1      3 1971  90.219126         0.72          90.219126
14       1      3 1972 107.271043         1.01         107.271043
15       1      3 1973 129.122561         1.18         129.122561
16       1      3 1974         NA         0.84         108.462951

This is my dplyr attempt (Is there perhaps a better way by writing a function or something? I guess this will also be handy if I want to backwards extrapolate missing values at the start of a series.)

> sales.data.extrapolated <- sales.data %>%
+   group_by(Product, Region) %>%
+   arrange(Product, Region, Year) %>%
+   mutate(Extrapolated.sales.manual = case_when(
+     row_number() > max(which(!is.na(Sales))) ~ lag(Sales) * World.growth,
+     TRUE ~ Sales
+   ))
> 
> print(sales.data.extrapolated)
# A tibble: 16 × 7
# Groups:   Product, Region [3]
   Product Region  Year  Sales World.growth Extrapolated.sales Extrapolated.sales.manual
     <dbl>  <dbl> <dbl>  <dbl>        <dbl>              <dbl>                     <dbl>
 1       1      1  1970  NA            0.88              NA                        NA   
 2       1      1  1971 142.           0.72             142.                      142.  
 3       1      1  1972   9.61         1.01               9.61                      9.61
 4       1      1  1973  NA            1.18              11.3                      11.3 
 5       1      1  1974  NA            0.84               9.52                     NA   
 6       1      2  1970  60.1          0.88              60.1                      60.1 
 7       1      2  1971 125.           0.72             125.                      125.  
 8       1      2  1972  NA            1.01             126.                      126.  
 9       1      2  1973  NA            1.18             149.                       NA   
10       1      2  1974  NA            0.84             125.                       NA   
11       1      2  1975  NA            1.23             154.                       NA   
12       1      3  1970  63.3          0.88              63.3                      63.3 
13       1      3  1971  90.2          0.72              90.2                      90.2 
14       1      3  1972 107.           1.01             107.                      107.  
15       1      3  1973 129.           1.18             129.                      129.  
16       1      3  1974  NA            0.84             108.                      108.  

As you can see, this only calculates the value for the first missing value and none beyond.


Solution

  • This is a "reduction" problem and must be done step-wise, since "this row's" new value depends on the results of "last row's" value ... the normal vectorized operations in R don't know the updated value for the previous value until after the whole vector has been calculated.

    For this, I'll use Reduce, though purrr::accumulate is nearly a drop-in replacement if you're already using purrr.

    sales.data %>%
      group_by(Product, Region) %>%
      mutate(
        newcol = Reduce(function(prevval, ind) coalesce(Sales[ind], prevval * World.growth[ind]),
                        row_number(), init = NA, accumulate = TRUE)[-1]
      ) %>%
      ungroup()
    # # A tibble: 16 × 7
    #    Product Region  Year  Sales World.growth Extrapolated.sales newcol
    #      <int>  <int> <int>  <dbl>        <dbl>              <dbl>  <dbl>
    #  1       1      1  1970  NA            0.88              NA     NA   
    #  2       1      1  1971 142.           0.72             142.   142.  
    #  3       1      1  1972   9.61         1.01               9.61   9.61
    #  4       1      1  1973  NA            1.18              11.3   11.3 
    #  5       1      1  1974  NA            0.84               9.52   9.52
    #  6       1      2  1970  60.1          0.88              60.1   60.1 
    #  7       1      2  1971 125.           0.72             125.   125.  
    #  8       1      2  1972  NA            1.01             126.   126.  
    #  9       1      2  1973  NA            1.18             149.   149.  
    # 10       1      2  1974  NA            0.84             125.   125.  
    # 11       1      2  1975  NA            1.23             154.   154.  
    # 12       1      3  1970  63.3          0.88              63.3   63.3 
    # 13       1      3  1971  90.2          0.72              90.2   90.2 
    # 14       1      3  1972 107.           1.01             107.   107.  
    # 15       1      3  1973 129.           1.18             129.   129.  
    # 16       1      3  1974  NA            0.84             108.   108.  
    

    Walkthrough:

    • Reduce is called either nrows-1 times (if init= is not provided) or nrows if it is provided;
    • the first time it is called, prevval is set to the init= value we provided (NA) and ind is 1 (the first from row_number());
      • if you're curious, had we omitted init= completely, then on the first iteration, prevval would be 1 and ind would be 2, the first two values in row_number(); not what we want
      • the intent when using this is that prevval is the previous row's Sales, and ind is the row number ... they are different "types" of numbers, not to be intermingled, which is why having prevval=1, ind=2 was a problem
    • the coalesce(..) calculates the expected (extrapolated) new value, but only uses it if Sales[ind] is NA; that is, coalesce returns (vector-wise in general, though here it is always length-1) the first non-NA value of the arguments provided
    • the return value of that coalesce becomes the prevval on the next iteration of the anon-function (which ind now being incremented, since we used row_number().
    • normally, Reduce gives you the last value of the iterations, but we want every step of the way, so we add accumulate=TRUE;
    • because we used init=, the return vector is actually length nrows+1, and we know that the first is a throw-away (because we used init=), so we remove it with [-1]

    The purrr::accumulate version, if you're curious:

    sales.data %>%
      group_by(Product, Region) %>%
      mutate(
        newcol = purrr::accumulate(row_number(),
            ~ coalesce(Sales[.y], .x * World.growth[.y]), .init=NA)[-1]
      ) %>%
      ungroup()
    

    Moving the row_number() as the first argument, using .init= as we did init=, and keeping [-1] as before.