Search code examples
rpanel-data

create panel data in R (cross section dimension holds repeated entities)


I have data that look like the following:

>loan data
           ID      loan_start_date    loan_maturity_date  feb13  march13 april13........
            1      2016-01-03         2017-01-03          46       45     44
            1      2011-01-08         2013-01-08          NA       NA     NA   
            1      2013-02-13         2015-02-13          23       22     21
            2      2012-02-03         2016-05-03          38       37     36
            2      2013-05-08         2014-01-09          10       09     08   
            2      2011-03-13         2013-02-18          0        NA     NA
            3      2015-07-03         2016-01-08          34       33     32
            3      2013-01-09         2015-07-08          28       27     26   

Can I create panel data from this? If yes, how can I do that in R? In the panel data, the cross-section dimension is ID and time dimension is feb13,march13,april13 (time to loan maturity date from that specific month).....goes on for 48 months. I have looked at other examples of creating panel data but in those examples, each ID just took one row, now each ID is sitting in multiple rows. So I am confused how to create panel out of this. Would really appreciate your help.

Edit: The expected outcome should look like this if I am correct:

>loan data
           ID      months    time to maturity  
            1      feb13         46          
            1      march13       45            
            1      april13       44          
                      .
                      .
                      .
            1      jan17          0          
            1      feb13         NA          
            1      march13       NA            
            1      april13       NA          
                      .
                      .
                      .
            1      jan17         NA          
            1      feb13         23          
            1      march13       22             
            1      april13       21          
                      .
                      .
                      .
            1      jan17         NA        
            2      feb13         38          
            2      march13       37            
            2      april13       36         
                     .
                     .
                     .
            2      jan17         NA 
            2      feb13         10          
            2      march13       09            
            2      april13       08         
                     .
                     .
                     .
            2      jan17         NA 
            2      feb13         0          
            2      march13       NA            
            2      april13       NA         
                     .
                     .
                     .
            2      jan17         NA 
            2      feb13         0          
            2      march13       NA            
            2      april13       NA         
                     .
                     .
                     .
            2      jan17         NA 
            3      feb13         34          
            3      march13       33            
            3      april13       32         
                     .
                     .
                     .
            3      jan17         NA 
            3      feb13         28          
            3      march13       27            
            3      april13       26         
                     .
                     .
                     .
            3      jan17         NA 

Solution

  • Another option is to use the gather() function from the tidyr package.

    This function turns multiple columns into a single key-value pair. You need to specify the data frame, the name of the new "key" column, the name of the new "value" column, and then specify which columns to gather. If (like in this case) there are more columns to include than exclude, you can simply specify which columns to exclude instead:

    library(tidyr)
    
    gather(data, key = "month_year", value = "months_to_maturity", -(ID:loan_maturity_date))
    
    #>    ID start_date loan_maturity_date month_year months_to_maturity
    #> 1   1 2016-01-03         2017-01-03      feb13                 46
    #> 2   1 2011-01-08         2013-01-08      feb13                 NA
    #> 3   1 2013-02-13         2015-02-13      feb13                 23
    #> 4   2 2012-02-03         2016-05-03      feb13                 38
    #> 5   2 2013-05-08         2014-01-09      feb13                 10
    #> 6   2 2011-03-13         2013-02-18      feb13                  0
    #> 7   3 2015-07-03         2016-01-08      feb13                 34
    #> 8   3 2013-01-09         2015-07-08      feb13                 28
    #> 9   1 2016-01-03         2017-01-03    march13                 45
    #> 10  1 2011-01-08         2013-01-08    march13                 NA
    #> 11  1 2013-02-13         2015-02-13    march13                 22
    #> 12  2 2012-02-03         2016-05-03    march13                 37
    #> 13  2 2013-05-08         2014-01-09    march13                  9
    #> 14  2 2011-03-13         2013-02-18    march13                 NA
    #> 15  3 2015-07-03         2016-01-08    march13                 33
    #> 16  3 2013-01-09         2015-07-08    march13                 27
    #> 17  1 2016-01-03         2017-01-03    april13                 44
    #> 18  1 2011-01-08         2013-01-08    april13                 NA
    #> 19  1 2013-02-13         2015-02-13    april13                 21
    #> 20  2 2012-02-03         2016-05-03    april13                 36
    #> 21  2 2013-05-08         2014-01-09    april13                  8
    #> 22  2 2011-03-13         2013-02-18    april13                 NA
    #> 23  3 2015-07-03         2016-01-08    april13                 32
    #> 24  3 2013-01-09         2015-07-08    april13                 26
    

    And the data used for this:

    df <- 
      data.frame(ID                 = c(1,1,1,2,2,2,3,3),
                 start_date         = c("2016-01-03",
                                        "2011-01-08",
                                        "2013-02-13",
                                        "2012-02-03",
                                        "2013-05-08",
                                        "2011-03-13",
                                        "2015-07-03",
                                        "2013-01-09"),
                 loan_maturity_date = c("2017-01-03",
                                        "2013-01-08",
                                        "2015-02-13",
                                        "2016-05-03",
                                        "2014-01-09",
                                        "2013-02-18",
                                        "2016-01-08",
                                        "2015-07-08"),
                 feb13              = c(46,
                                        NA,
                                        23,
                                        38,
                                        10,
                                        0 ,
                                        34,
                                        28),
                 march13            = c(45,
                                        NA,
                                        22,
                                        37,
                                        09,
                                        NA,
                                        33,
                                        27),
                 april13            = c(44,
                                        NA,
                                        21,
                                        36,
                                        08,
                                        NA,
                                        32,
                                        26))