Search code examples
rdataframedataformatlong-format-data

How to convert 'event' data to panel data?


I have data for a number of retail branches and the date at which they closed, if they didn't close then this is NA. I would like to expand these data so that they are a panel, with a 0/1 indicator to show if the branch was closed in that, or subsequent, years.

This is an example of the data format that I have and the data format that I would like. Here the data covers the period 2015 to 2019, 5 years. Branches A, B and D stay open, but branch C closes in 2016 and branch E in 2019.

branch <- LETTERS[1:5]
yearclosed <- c(NA, NA, 2016, NA, 2019)

have.df <- data.frame(branch, yearclosed)
have.df

branch <- c(rep("A",5), rep("B",5), rep("C",5), rep("D",5), rep("E",5))
year <- rep(2015:2019, 5)
closed <-c (rep(0,5), rep(0,5), 0,1,1,1,1, rep(0,5), 0,0,0,0,1)

want.df <- data.frame(branch, year, closed)
want.df

I have tried to experiment with converting from wide to long format but not made much progress. I could write a couple of for-loops, but these are generally not the best solution in R? Does anyone have any similar experiences they can share with me? Thanks.


Solution

  • Code from question to generate have.df

    branch <- LETTERS[1:5]
    yearclosed <- c(NA, NA, 2016, NA, 2019)
    have.df <- data.frame(branch, yearclosed)
    

    Load libraries

    library(dplyr)
    library(tidyr)
    

    Prepare have.df by triming of missings, renaming the yearclosed, and creating a new closed column that is 1 for every row.

    have.df <- 
      have.df |> 
      filter(!is.na(yearclosed)) |> 
      rename(year = yearclosed) |> 
      mutate(closed = 1)
    

    Using tidyr::expand_grid() we can create the first two columns of the data.frame you want and then join it with the modified have.df to obtain the result you are looking for. Using group_by() and fill() we can set all years after the initial closing year to 1. mutate() and coalesce() help us to set all missing values in closed to 0.

    expand_grid(branch = LETTERS[1:5], year = 2015:2019) |> 
      left_join(have.df) |> 
      group_by(branch) |> 
      fill(closed) |> 
      mutate(closed = coalesce(closed, 0)) |> 
      print(n = 25)
    #> Joining, by = c("branch", "year")
    #> # A tibble: 25 × 3
    #>    branch  year closed
    #>    <chr>  <dbl>  <dbl>
    #>  1 A       2015      0
    #>  2 A       2016      0
    #>  3 A       2017      0
    #>  4 A       2018      0
    #>  5 A       2019      0
    #>  6 B       2015      0
    #>  7 B       2016      0
    #>  8 B       2017      0
    #>  9 B       2018      0
    #> 10 B       2019      0
    #> 11 C       2015      0
    #> 12 C       2016      1
    #> 13 C       2017      1
    #> 14 C       2018      1
    #> 15 C       2019      1
    #> 16 D       2015      0
    #> 17 D       2016      0
    #> 18 D       2017      0
    #> 19 D       2018      0
    #> 20 D       2019      0
    #> 21 E       2015      0
    #> 22 E       2016      0
    #> 23 E       2017      0
    #> 24 E       2018      0
    #> 25 E       2019      1