Search code examples
rpanel

Create a panel data frame in R - where one of the values within a variable changes through time


Using the following data:

id <- c(10420,10740, 12060,12060, 12060)
year <- c(2004, 2002, 2003, 2003, 2006)
y <- c(1,1,1,1,1)
data <- data.frame(id, year, y)
data



 id year y
1 10420 2004 1
2 10740 2002 1
3 12060 2003 1
4 12060 2003 1
5 12060 2006 1

I would like to create a panel data with the following structure:

datapasta::tribble_paste(paneldata)     
  
 tibble::tribble(
~year,   ~id,    ~y,
 2002, 10420,      0,
 2003, 10420,      0,
 2004, 10420,      1,
 2005, 10420,      1,
 2006, 10420,      1,
 2007, 10420,      1,
 2002, 10740,      1,
 2003, 10740,      1,
 2004, 10740,      1,
 2005, 10740,      1,
 2006, 10740,      1,
 2007, 10740,      1,
 2002, 12060,      0,
 2003, 12060,      2,
 2004, 12060,      2,
 2005, 12060,      2,
 2006, 12060,      3,
 2007, 12060,      3
)

I have tried to following code to create the panel dataset above.

g <- with(data, expand.grid(year = seq(2002,2007), id= unique(id),y=0)) 
data2 <- rbind(data, g)  
data3 <-aggregate(y ~ year + id, data2, sum)[c("id", "year", "y")]

However, the code above does not create the panel data as if fails to do the following: For a particular id & year, the values within the variable y should change if a new observation is present. For example, the values for y with id 12060 will be zero for year 2002 then it will change to a value of 2 in the year 2003 and 2005, and then it will change to a value of 3 for years 2006 to 2007.

Any and all thoughts would be greatly appreciated.


Solution

  • We expand the rows with complete (after doing a group by 'id') by getting the sequence of min to max + 1 of 'year' for entire data and get the cumulative sum of 'y' after grouping by 'id'

    library(dplyr)
    library(tidyr)   
    data %>% 
        group_by(id, year) %>% 
        mutate(y = cumsum(y)) %>% 
        group_by(id) %>% 
        filter(!duplicated(year, fromLast = TRUE)) %>% 
        complete(year = min(.$year):(max(.$year) + 1), fill = list(y = 0)) %>% 
        mutate(y = cumsum(y))
    # A tibble: 18 x 3
    # Groups:   id [3]
    #      id  year     y
    #   <dbl> <dbl> <dbl>
    # 1 10420  2002     0
    # 2 10420  2003     0
    # 3 10420  2004     1
    # 4 10420  2005     1
    # 5 10420  2006     1
    # 6 10420  2007     1
    # 7 10740  2002     1
    # 8 10740  2003     1
    # 9 10740  2004     1
    #10 10740  2005     1
    #11 10740  2006     1
    #12 10740  2007     1
    #13 12060  2002     0
    #14 12060  2003     2
    #15 12060  2004     2
    #16 12060  2005     2
    #17 12060  2006     3
    #18 12060  2007     3