Search code examples
rdplyrpanel-data

Creating panel data, filling gaps between years and repeating the last value in the subsequent years using R


I have the following sample: original data

I am trying to turn it into the following panel data: expected

As you can see in the last image, I would like to repeat the values between years for the same country, and repeat the last value for the subsequent years until the year 2020.


Solution

  • You can use grid.expand to get the country / year combinations you want, then left_join the main data frame to this, and finally fill the missing data, ensuring you filter out any remaining NAs.

    library(dplyr)
    library(tidyr)
    
    panel <- expand.grid(year = min(df$year):2020, 
                         country = unique(df$country), 
                         stringsAsFactors = FALSE)     %>%
               left_join(df)                           %>% 
               group_by(country)                       %>% 
               fill(c("id", "regioncode", "prespowl")) %>%
               filter(!is.na(id))                      %>%
               as.data.frame()
    

    Which gives the following result:

    panel
    #>    year   country id regioncode  prespowl
    #> 1  2011   Albania  1     Europe 0.1817557
    #> 2  2012   Albania  1     Europe 0.1817557
    #> 3  2013   Albania  1     Europe 0.1817557
    #> 4  2014   Albania  1     Europe 0.1817557
    #> 5  2015   Albania  1     Europe 0.1817557
    #> 6  2016   Albania  1     Europe 0.1817557
    #> 7  2017   Albania  1     Europe 0.1817557
    #> 8  2018   Albania  1     Europe 0.1411482
    #> 9  2019   Albania  1     Europe 0.1411482
    #> 10 2020   Albania  1     Europe 0.1411482
    #> 11 2016   Algeria  2     Africa 0.3837466
    #> 12 2017   Algeria  2     Africa 0.3837466
    #> 13 2018   Algeria  2     Africa 0.4837466
    #> 14 2019   Algeria  2     Africa 0.4837466
    #> 15 2020   Algeria  2     Africa 0.4837466
    #> 16 1999 Argentina  3   Americas 0.2887138
    #> 17 2000 Argentina  3   Americas 0.2887138
    #> 18 2001 Argentina  3   Americas 0.2887138
    #> 19 2002 Argentina  3   Americas 0.2887138
    #> 20 2003 Argentina  3   Americas 0.2887138
    #> 21 2004 Argentina  3   Americas 0.2887138
    #> 22 2005 Argentina  3   Americas 0.2887138
    #> 23 2006 Argentina  3   Americas 0.4322523
    #> 24 2007 Argentina  3   Americas 0.4322523
    #> 25 2008 Argentina  3   Americas 0.4322523
    #> 26 2009 Argentina  3   Americas 0.4322523
    #> 27 2010 Argentina  3   Americas 0.4322523
    #> 28 2011 Argentina  3   Americas 0.4322523
    #> 29 2012 Argentina  3   Americas 0.4322523
    #> 30 2013 Argentina  3   Americas 0.5453171
    #> 31 2014 Argentina  3   Americas 0.5453171
    #> 32 2015 Argentina  3   Americas 0.5453171
    #> 33 2016 Argentina  3   Americas 0.5453171
    #> 34 2017 Argentina  3   Americas 0.5453171
    #> 35 2018 Argentina  3   Americas 0.5453171
    #> 36 2019 Argentina  3   Americas 0.5453171
    #> 37 2020 Argentina  3   Americas 0.5453171
    

    Data used:

    df <- read.table(text= 'country year    id  regioncode   prespowl
                            Albania 2011    1   Europe      0.1817557
                            Albania 2018    1   Europe      0.1411482
                            Algeria 2016    2   Africa      0.3837466
                            Algeria 2018    2   Africa      0.4837466
                          Argentina 1999    3   Americas    0.2887138
                          Argentina 2006    3   Americas    0.4322523
                          Argentina 2013    3   Americas    0.5453171
    ', header = TRUE, stringsAsFactors = FALSE)