Search code examples
rdatetimetransformationdummy-variable

How to change from event data to time-series cross-sectional data with event dummies?


I have been given a dataset that lists dates for specific events by country in the manner below.

country date1 date2   
1    03/01/2012    05/01/2012
2    05/04/2012    12/10/2012
3    07/12/2012    20/03/2012
4    04/02/2012    24/12/2012

What I need to do with this data is create panel data for country/year/month/day levels. I would want to create a dummy variable for each event.

country year month   day
1  2012    01    01
1  2012    01    02
1  2012    01    03
1  2012    01    04
1  2012    01    05
1  2012    01    06

The end result of this would look like this with each country panel having either a 0 or a 1 for every year/month/day in each separate event variable.

country year month day event1 event2 
1  2012    01    01    0    0 
1  2012    01    02    0    0
1  2012    01    03    1    0
1  2012    01    04    1    0
1  2012    01    05    1    1
1  2012    01    06    1    1

The question is how to most efficiently get from the data that I do have to the data structure that I need. I found a previous question that had a similar question (Dummy Variable by date.), but this question did not deal with panel data.


Solution

  • Here is a tidyverse solution. The idea is to use tidyr::complete to generate the full set of date-country combinations that you want. Then it's easy to use tidyr::spread to split the values of has_event into separate columns for each event and create the right column names with the sep argument. The rest is just cleaning up - converting the date into separate year, month, day columns, removing extraneous columns, and replacing the NA with 0 in the event columns. This should be robust to more countries, more events per country, or large date ranges.

    library(tidyverse)
    library(lubridate)
    #> 
    #> Attaching package: 'lubridate'
    #> The following object is masked from 'package:base':
    #> 
    #>     date
    tbl <- read_table2(
      "country date1 date2
      1    03/01/2012    05/01/2012
      2    05/04/2012    12/10/2012
      3    07/12/2012    20/03/2012
      4    04/02/2012    24/12/2012"
    )
    #> Warning in rbind(names(probs), probs_f): number of columns of result is not
    #> a multiple of vector length (arg 2)
    #> Warning: 1 parsing failure.
    #> row # A tibble: 1 x 5 col     row col   expected actual        file         expected   <int> <chr> <chr>    <chr>         <chr>        actual 1     4 date2 ""       embedded null literal data file # A tibble: 1 x 5
    
    tbl %>%
      gather(event, date, date1:date2) %>%
      mutate(date = dmy(date)) %>%
      complete(country, date = seq.Date(min(date), max(date), 1)) %>%
      mutate(
        event = str_remove_all(event, "date"),
        has_event = ifelse(is.na(event), 0, 1)
        ) %>%
      spread(event, has_event, sep = "") %>%
      mutate_at(vars(event1:event2), replace_na, 0) %>%
      mutate(
        year = year(date),
        month = month(date),
        day = day(date)
      ) %>%
      select(country, year:day, event1:event2)
    #> # A tibble: 1,428 x 6
    #>    country  year month   day event1 event2
    #>      <int> <dbl> <dbl> <int>  <dbl>  <dbl>
    #>  1       1 2012.    1.     3     1.     0.
    #>  2       1 2012.    1.     4     0.     0.
    #>  3       1 2012.    1.     5     0.     1.
    #>  4       1 2012.    1.     6     0.     0.
    #>  5       1 2012.    1.     7     0.     0.
    #>  6       1 2012.    1.     8     0.     0.
    #>  7       1 2012.    1.     9     0.     0.
    #>  8       1 2012.    1.    10     0.     0.
    #>  9       1 2012.    1.    11     0.     0.
    #> 10       1 2012.    1.    12     0.     0.
    #> # ... with 1,418 more rows
    

    Created on 2018-03-21 by the reprex package (v0.2.0).