Search code examples
rmergedata.tabledcast

Dcast/merge based on a column, with a value within a certain range


I have a panel dataset: panel and a dataset with a list of events: Events. For the panel dataset, an equal panelID shows that two observations belong together.

panelID = c(1:50)   
year= c(2001:2010)
country = c("NLD", "GRC", "GBR")

n <- 2

library(data.table)
set.seed(123)
Panel <- data.table(panelID = rep(sample(panelID), each = n),
                 country = rep(sample(country, length(panelID), replace = T), each = n),
                 year = c(replicate(length(panelID), sample(year, n))),
                 some_NA = sample(0:5, 6),                                             
                 some_NA_factor = sample(0:5, 6),         
                 norm = round(runif(100)/10,2),
                 Income = round(rnorm(10,-5,5),2),
                 Happiness = sample(10,10),
                 Sex = round(rnorm(10,0.75,0.3),2),
                 Age = sample(100,100),
                 Educ = round(rnorm(10,0.75,0.3),2))        
Panel[, uniqueID := .I]                                                                        # Creates a unique ID     
Panel[Panel == 0] <- NA    

Events <- fread(
"Event_Type  country year   
A   NLD   2005
A   NLD   2004       
A   GBR   2006
A   GBR   2003   
A   GRC   2002             
A   GRC   2007",
header = TRUE)

I want to know how often Events happen between the panel obervations, split out per year. As an example, for the panel observations with panelID == 2, in the country NLD there are two events, in or in between the years of that panel observation, namely in 2004 and 2005. Hence:

DESIRED OUTPUT:

panleID country year 2002  2003  2004 2005 2006 2007 
2       NLD     2004 NA    NA    1    1    NA   NA 
2       NLD     2007 NA    NA    1    1    NA   NA 

Based on the solution by Cole I tried to do the following:

# cast Event 
Events_cast <- reshape2::dcast(Events, country + year ~ year, length, value.var="year")

# update by reference for join later
Panel[, `:=`(start = min(year), end = max(year)), by = panelID]

# dcast sorts the rhs alphabetically
cols <- sort(unique(Events[['year']]))

# non-equi update join
Panel[Events_cast,
      on = .(country,
             start <= year,
             end >= year),
      (cols) := mget(cols)]

#clean up data frame
setnafill(Panel, fill = 0L, cols = cols)
Panel[, `:=`(start = NULL, end = NULL)]

Panel

But at the # non-equi update join I get the error: Error in [.data.table (Panel, Events, on = .(country, : LHS of := appears to be column positions but are outside [1,ncol] range. New columns can only be added by name.


Solution

  • is trying to use the years to figure out which column you are selecting. The error is telling you 2006 and other years are not valid column numbers. The fix is easy:

    cols <- as.character(sort(unique(Events[['year']])))
    

    And here's everything together with a few other changes including:

    1. Using data.table::dcast instead of reshape2::dcast
    2. Adding start and end to the Events data.table and casting with those columns.
    # cast Event 
    # Events_cast <- reshape2::dcast(Events, country + year ~ year, length, value.var="year")
    Events[, `:=`(start = min(year), end = max(year)), by = country]
    Events_cast <- dcast(Events, country + start + end~ year, length)
    
    # update by reference for join later
    Panel[, `:=`(start = min(year), end = max(year)), by = panelID]
    
    # dcast sorts the rhs alphabetically
    cols <- as.character(sort(unique(Events[['year']])))
    
    # non-equi update join
    # Panel[Events_cast,
    #       on = .(country,
    #              start <= year,
    #              end >= year),
    #       (cols) := mget(cols)]
    
    Panel[Events_cast,
          on = .(country,
                 start <= start,
                 end >= end),
          (cols) := mget(cols)]
    
    #clean up data frame
    setnafill(Panel, fill = 0L, cols = cols)
    Panel[, `:=`(start = NULL, end = NULL)]
    
    Panel