Search code examples
rdplyrinterpolation

Interpolating NA values within group (long-format data)


I am trying to interpolate missing NA values within a data frame that is in "long-format," but am having trouble. Here is an example:

library(tidyverse)
library(zoo)

#step 1 - write data frame
df <- tibble(day=1:10,
             walmart =  c(1, 2, 3, 4, NA, NA, 7, 8 , 9, 10),
             amazon= c(10, 15, 20, NA, 30, 35, 40, NA, 50, 55))

#step 2 - convert to long format 
df <- df %>%
  pivot_longer(cols= c(walmart, amazon),
               names_to = "company",
               values_to = "sales")

#step 3 - perform interpolation
df <- df %>% 
  mutate(sales=na.approx(sales))

Evidently, the interpolation is not taking into account the two different stores. I would like to interpolate the missing values for walmart based on the surrounding values for walmart only, and the missing values for amazon based on the surrounding values for amazon only. I do recognize that performing the interpolation with the data in wide format (as it is after step 1) on each column (in this case walmart and amazon) is possible, but I am trying to avoid doing this with my actual data because a) there are thousands of "companies" and b) I want to interpolate several variables so by transposing the data I would just put myself back in the same position. Moreover, I don't thinking using group_by(company) will work because that yields a size error with na.approx. Is there a way to resolve this? Thank you very much.


Solution

  • Group by company (not sales):

    df %>%
      group_by(company) %>%
      mutate(sales=zoo::na.approx(sales)) %>%
      ungroup()
    # # A tibble: 20 x 3
    #      day company sales
    #    <int> <chr>   <dbl>
    #  1     1 walmart     1
    #  2     1 amazon     10
    #  3     2 walmart     2
    #  4     2 amazon     15
    #  5     3 walmart     3
    #  6     3 amazon     20
    #  7     4 walmart     4
    #  8     4 amazon     25
    #  9     5 walmart     5
    # 10     5 amazon     30
    # 11     6 walmart     6
    # 12     6 amazon     35
    # 13     7 walmart     7
    # 14     7 amazon     40
    # 15     8 walmart     8
    # 16     8 amazon     45
    # 17     9 walmart     9
    # 18     9 amazon     50
    # 19    10 walmart    10
    # 20    10 amazon     55
    

    Since you're running into errors in your use, that suggests that you have NAs on either end of a company's set of values. For instance,

    df$amazon[10] <- NA
    df
    # # A tibble: 10 x 3
    #      day walmart amazon
    #    <int>   <dbl>  <dbl>
    #  1     1       1     10
    #  2     2       2     15
    #  3     3       3     20
    #  4     4       4     NA
    #  5     5      NA     30
    #  6     6      NA     35
    #  7     7       7     40
    #  8     8       8     NA
    #  9     9       9     50
    # 10    10      10     NA
    df <- df %>%
      pivot_longer(cols= c(walmart, amazon),
                   names_to = "company",
                   values_to = "sales")
    df %>%
      group_by(company) %>%
      mutate(sales=zoo::na.approx(sales)) %>%
      ungroup()
    Error in `mutate()`:
    # ! Problem while computing `sales = zoo::na.approx(sales)`.
    # x `sales` must be size 10 or 1, not 9.
    # i The error occurred in group 1: company = "amazon".
    # Run `rlang::last_error()` to see where the error occurred.
    

    The issue with this is that zoo::na.approx is shortening your vector. Consider this:

    zoo::na.approx(c(1,NA,3))
    # [1] 1 2 3
    zoo::na.approx(c(1,NA,3,NA))
    # [1] 1 2 3
    zoo::na.approx(c(1,NA,3,NA), na.rm=FALSE)
    # [1]  1  2  3 NA
    

    So add na.rm=FALSE to your call to na.approx:

    df %>%
      group_by(company) %>%
      mutate(sales=zoo::na.approx(sales, na.rm = FALSE)) %>%
      ungroup()
    # A tibble: 20 x 3
    #      day company sales
    #    <int> <chr>   <dbl>
    #  1     1 walmart     1
    #  2     1 amazon     10
    #  3     2 walmart     2
    #  4     2 amazon     15
    #  5     3 walmart     3
    #  6     3 amazon     20
    #  7     4 walmart     4
    #  8     4 amazon     25
    #  9     5 walmart     5
    # 10     5 amazon     30
    # 11     6 walmart     6
    # 12     6 amazon     35
    # 13     7 walmart     7
    # 14     7 amazon     40
    # 15     8 walmart     8
    # 16     8 amazon     45
    # 17     9 walmart     9
    # 18     9 amazon     50
    # 19    10 walmart    10
    # 20    10 amazon     NA
    

    Note that since this is doing interpolation, the NA values outside of the known range will not be replaced. If you need that, consider what form of extrapolation you need.