Search code examples
rdata.tablepanel-data

Calculate difference of variable values from different rows based on combination of other variables' values from different rows in R


I have unbalanced panel data identified by id and period in a data.table. There are 8,278 observations and 230 variables.

I want to find out how long it takes the companies (identified by id) in my data from planning to enter a market (plan_entry == "yes", contains NAs) until they actually enter the market (enter_market == "yes", contains no NAs).

Hence, I want to generate e.g. time_to_entry == 5 if in period == 4 a company plans to enter a market and in period == 9 finally enters the market. The data are roughly structured as follows and already include the desired output variable. Note, that companies might enter a market without stating any previous plans to do so. It might also be that they have plans to enter a new market and enter a new market in the same period. In both cases I want time_to_entry == 0. It should also be 0 if a company never enters any market.

Exemplary data and desired outcome

library(data.table)

desired_output <- 
  data.table(id = as.factor(c(rep("C001", 3), "C002", rep("C003", 5), rep("C004", 2), rep("C005", 7))),
             period = as.factor(c(1, 2, 3, 2, 1, 4, 5, 6, 10, 3, 4, 2, 3, 4, 7, 8, 9, 10)),
             plan_entry = as.factor(c(rep(NA, 2), "yes", "no", NA, rep("no", 2), rep("yes", 4), rep(NA, 2), rep("yes", 4), "no")),
             enter_market = as.factor(c(rep("no", 3), "yes", rep("no", 5), rep("yes", 2), rep("no", 5), rep("yes", 2))),
             time_to_entry = c(rep(0, 10), 1, rep(0, 5), 5, 1))

desired_output
#      id period plan_entry enter_market time_to_entry
# 1: C001      1       <NA>           no             0
# 2: C001      2       <NA>           no             0
# 3: C001      3        yes           no             0
# 4: C002      2         no          yes             0
# 5: C003      1       <NA>           no             0
# 6: C003      4         no           no             0
# 7: C003      5         no           no             0
# 8: C003      6        yes           no             0
# 9: C003     10        yes           no             0
#10: C004      3        yes          yes             0     ! there might be cases 
# where companies enter a market without stating any plans to do so in previous periods
#11: C004      4        yes          yes             1 
#12: C005      2       <NA>           no             0
#13: C005      3       <NA>           no             0
#14: C005      4        yes           no             0
#15: C005      7        yes           no             0
#16: C005      8        yes           no             0
#17: C005      9        yes          yes             5
#18: C005     10         no          yes             1       

Problem description

So, I need a command that looks for the first period where plan_entry == "yes" for a specific id, then searches in the following periods for enter_market == "yes" and calculates the difference between the respective periods and stores it in time_to_entry. Then, it should start from this period and look for the next plan_entry == "yes" for a specific id (This could be in the same period where a company enters a market. However, this case should not be considered but only the one in the next period where enter_market == "yes".) etc.

Does anyone have an idea how to do this?

Incomplete step-by-step approach

In the following I tried another approach but it is not considering all of the requirements since it only considers the first time a company entered a market. I'd also be very happy to learn about a data.table approach.

  1. Search for minimum period where dt$enter_market == "yes" for each company
library(data.table)
library(dplyr)

# generate almost same dataset but without desired variable time_to_entry
dt <- 
  data.table(id = as.factor(c(rep("C001", 3), "C002", rep("C003", 5), rep("C004", 2), rep("C005", 7))),
             period = as.factor(c(1, 2, 3, 2, 1, 4, 5, 6, 10, 3, 4, 2, 3, 4, 7, 8, 9, 10)),
             plan_entry = as.factor(c(rep(NA, 2), "yes", "no", NA, rep("no", 2), rep("yes", 4), rep(NA, 2), rep("yes", 4), "no")),
             enter_market = as.factor(c(rep("no", 3), "yes", rep("no", 5), rep("yes", 2), rep("no", 5), rep("yes", 2))))


# generate minimum period by company
dt[, min_period := min(as.numeric(period)), by = id] 

# make data.table a data.frame
dt <- as.data.frame(dt)

# use dplyr
dt <-
  dt %>%
  group_by(id, enter_market) %>% # group data by id and market entry
  mutate(min_entry_period =  min(as.numeric(period))) # generate minimum period for grouped data

# minimum period for companies where plan_entry == "yes"
dt$entry_period <-
  ifelse(
    dt$min_period != dt$min_entry_period & dt$plan_entry == "yes",
    dt$min_entry_period,
    NA)

dt
# A tibble: 18 x 8
# Groups:   id, enter_market [6]
#   id    period plan_entry enter_market time_to_entry min_period min_entry_period entry_period
#   <fct> <fct>  <fct>      <fct>                <dbl>      <dbl>            <dbl>        <dbl>
# 1 C001  1      NA         no                       0          1                1           NA
# 2 C001  2      NA         no                       0          1                1           NA
# 3 C001  3      yes        no                       0          1                1           NA
# 4 C002  2      no         yes                      0          2                2           NA
# 5 C003  1      NA         no                       0          1                1           NA
# 6 C003  4      no         no                       0          1                1           NA
# 7 C003  5      no         no                       0          1                1           NA
# 8 C003  6      yes        no                       0          1                1           NA
# 9 C003  10     yes        no                       0          1                1           NA
#10 C004  3      yes        yes                      0          3                3           NA
#11 C004  4      yes        yes                      1          3                3           NA
#12 C005  2      NA         no                       0          2                2           NA
#13 C005  3      NA         no                       0          2                2           NA
#14 C005  4      yes        no                       0          2                2           NA
#15 C005  7      yes        no                       0          2                2           NA
#16 C005  8      yes        no                       0          2                2           NA
#17 C005  9      yes        yes                      5          2                9            9
#18 C005  10     no         yes                      0          2                9           NA
  1. All periods where dt$plan_entry == "yes"
dt$plan_entry_period <-
  ifelse(dt$plan_entry == "yes", dt$period, NA)

  1. For every company where dt$plan_entry == "yes" calculate entry_period – plan_entry_period
# fill in first entry_period for each observation by company
library(zoo) # for na.locf()

dt <- as.data.table(dt)
dt[, entry_period := na.locf(entry_period, na.rm = FALSE, fromLast = FALSE), by = id]
dt[, entry_period := na.locf(entry_period, na.rm = FALSE, fromLast = TRUE), by = id]

dt$time_to_entry <-
  ifelse(
    dt$plan_entry == "yes", 
    dt$entry_period - dt$plan_entry_period,
    NA)

# check variable
summary(dt$time_to_entry)

dt
#      id period plan_entry enter_market min_period min_entry_period entry_period plan_entry_period time_to_entry
# 1: C001      1       <NA>           no          1                1           NA                NA            NA
# 2: C001      2       <NA>           no          1                1           NA                NA            NA
# 3: C001      3        yes           no          1                1           NA                 3            NA
# 4: C002      2         no          yes          2                2           NA                NA             0
# 5: C003      1       <NA>           no          1                1           NA                NA            NA
# 6: C003      4         no           no          1                1           NA                NA             0
# 7: C003      5         no           no          1                1           NA                NA             0
# 8: C003      6        yes           no          1                1           NA                 6            NA
# 9: C003     10        yes           no          1                1           NA                10            NA
#10: C004      3        yes          yes          3                3           NA                 3            NA
#11: C004      4        yes          yes          3                3           NA                 4            NA
#12: C005      2       <NA>           no          2                2            9                NA            NA
#13: C005      3       <NA>           no          2                2            9                NA            NA
#14: C005      4        yes           no          2                2            9                 4             5
#15: C005      7        yes           no          2                2            9                 7             2
#16: C005      8        yes           no          2                2            9                 8             1
#17: C005      9        yes          yes          2                9            9                 9             0
#18: C005     10         no          yes          2                9            9                NA             0

Obviously, time_to_entry looks very different compared to the desired_result dataset.

I hope I was able to describe the problem well enough. I'd really appreciate any advice! Thanks in advance!


Solution

  • Here is an option using non-equi join:

    #find the previous latest enter_market before current row
    DT[enter_market=="yes", prev_entry := 
        fcoalesce(.SD[.SD, on=.(id, period<period), mult="last", x.period], 0L)
    ]
    
    #non-equi join to find the first plan_entry before current enter_market but after previous latest enter_market
    DT[enter_market=="yes", plan_period :=
        DT[plan_entry=="yes"][.SD, on=.(id, period>=prev_entry, period<period), mult="first", x.period]
    ]
    
    #calculate time_to_entry and set NAs to 0
    DT[, time_to_entry := fcoalesce(period - plan_period, 0L)]
    
    DT
    

    data (without factors):

    DT <- 
        data.table(id = c(rep("C001", 3), "C002", rep("C003", 5), rep("C004", 2), rep("C005", 7)),
            period = as.integer(c(1, 2, 3, 2, 1, 4, 5, 6, 10, 3, 4, 2, 3, 4, 7, 8, 9, 10)),
            plan_entry = c(rep(NA, 2), "yes", "no", NA, rep("no", 2), rep("yes", 4), rep(NA, 2), rep("yes", 4), "no"),
            enter_market = c(rep("no", 3), "yes", rep("no", 5), rep("yes", 2), rep("no", 5), rep("yes", 2)))