Search code examples
reventspaneltime-seriesdummy-data

How to insert "event dummy" (?) based on time interval in different dataframe


I have a dataframe which provides the start and end date of an event for different countries. Events can occur for several times for each country (e.g. country A).

Start.Year <- c("1990","1992","1997","1995")
End.Year <- c("1995","1993","2000","1996")
Country <- c("A","B","A","C")
a <- data.frame(Start.Year,End.Year,Country)
a$Start.Year <- as.numeric(as.character(a$Start.Year))
a$End.Year <- as.numeric(as.character(a$End.Year))

      Start.Year End.Year Country
       1990     1995       A
       1992     1993       B
       1997     2000       A
       1995     1996       C 

I have a second data frame which is in a time-series cross section format (Year/Country/Event(Yes/No).

b1 <-as.data.frame(expand.grid(year=(1990:2000), Country=unique(a$Country)))
b1$Event <-0   
b1$year <- as.numeric(as.character(b1$year))

How can I obtain the result below (apologies for the clumsy presentation). Event should be "1" when the year is between the start and end year of the first dataframe; for each country; the second dataframe exists already, meaning that I don't want to convert the first dataframe, but rather match (?) the information from the first dataframe to the second one.

I tried

b1$Event[a$Start.Year<=b1$year & a$End.Year>=b1$year] <- 1 

but get "longer object length is not a multiple of shorter object length" as error message. Grateful for any hint/advice!

Result aimed at:

  Year Country Event
  1990       A     1
  1991       A     1
  1992       A     1
  1993       A     1
  1994       A     1
  1995       A     1
  1996       A     0
  1997       A     1
  1998       A     1
  1999       A     1
  2000       A     1
  1990       B     0
  1991       B     0
  1992       B     1
  1993       B     1
  1994       B     0
  1995       B     0
  1996       B     0
  1997       B     0
  1998       B     0
  1999       B     0
  2000       B     0
  1990       C     0
  1991       C     0
  1992       C     0
  1993       C     0
  1994       C     0
  1995       C     1
  1996       C     1
  1997       C     0
  1998       C     0
  1999       C     0
  2000       C     0

Solution

  • Here is a solution using the rolling join feature in data.table. I have slightly changed (fixed?) your definition of a and removed the Event column in b1.

    require(data.table)
    
    Start.Year <- c(1990, 1992, 1997, 1995)
    End.Year <- c(1995, 1993, 2000, 1996)
    Country <- c("A", "B", "A", "C")
    a <- data.frame(Start.Year, End.Year, Country)
    a <- data.table(a)   ## convert to use feature
    
    b1 <-as.data.frame(expand.grid(year=(1990:2000), Country=unique(a$Country)))
    b1 <- data.table(b1)  ## convert 
    
    ##  join by Start.Year, setting matching keys for each dataset
    setkey(a, Country, Start.Year)
    setkey(b1, Country, year)
    
    # the tricky part
    # roll=TRUE means all years will match to 
    # next smallest event Start.Year
    ab <- a[b1, roll=TRUE] 
    
    setnames(ab, c('Country', 'Year', 'Event'))  ## fix names
    ab[Year > Event, Event:=NA]  ## stop index at end year
    ab[!is.na(Event), Event:=1]  ## transform year markers to 1
    ab[is.na(Event), Event:=0]   ## transform missing matches to 0
    

    ab is the data in the format you want. You can use it just like a data.frame or convert it back if you don't want to keep it in that class. The join should be very fast.