The context of my data is that I have (partial) time series data on the rental values for properties in a neighbourhood. A ‘treatment’ is a bank closing in the neighbourhood and I want to estimate the effect of the closure on the rental value. To do this I am using the difference in difference approach advocated by Callaway and Sant’Anna, 2021. Thanks to the help that I received via this posting (Expand and then fill a dataframe) I was able to construct a data frame suitable for analysis within the R package did
. This code here creates the data frame df5
for an example with four properties. I have added a flag to say if this is a true observation or an infill (not required by did
).
library(tidyverse)
year <- c(2014, 2020, 2021)
price <- c(100, 110, 120)
df0 <- data.frame(cbind(id=1, year, price))
year <- c(2014, 2019, 2021)
price <- c(100, 110, 120)
df1 <- data.frame(cbind(id=1, year, price))
year <- c(2019, 2020, 2021)
price <- c(210, 220, 230)
df2 <- data.frame(cbind(id=2, year, price))
year <-c (2014, 2015, 2019)
price <-c (300, 310, 320)
df3 <- data.frame(cbind(id=3, year, price))
id <- c(rep(0,8), rep(1,8), rep(2,8), rep(3,8))
year <- c(rep(seq(2014,2021), 4))
price <- c(100, NA, NA, NA, NA, NA, 110, 120,
100, NA, NA, NA, NA, 110, NA, 120,
NA, NA, NA, NA, NA, 210, 220, 230,
300, 310, NA, NA, NA, 320, NA, NA)
df4 <- data.frame(id, year, price, obs = !is.na(price))
df5 <- df4 %>% group_by( id ) %>% fill( price, .direction = "downup" )
df5$gyear <- c(rep(0,8), rep(2016,8), rep(2017,8), rep(2020,8))
My problem now is that I need to filter these properties to ensure that either it is not treated or if it is treated, there is at least one observation before the treatment and one at or after treatment.
Take id=0. This property is never treated gyear == 0
and needs to be kept.
Take id=1. This property is treated in year 2016, gyear == 2016
and we have at least one before and one at or after treatment observation and needs to be kept.
Take id=2. This property is treated in 2017, gyear == 2017
, however we do not have a before treatment observation and needs to be REMOVED.
Take id=3. This property is treated in 2020, gyear == 2020
, however we do not have an at or after treatment observation and needs to be REMOVED.
Thanks.
Callaway, B. and Sant’Anna, P.H., 2021. Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), pp.200-230.
This solution takes about 10 minutes on a data frame with 700k properties. If this was too long the for loop could easily be parallelised.
# ensure in id and year order
df5 <- df5[order(df5$id, df5$year),]
list1 <- split(df5, df5$id)
to_keep <- c()
for (i in 1:length(list1)){
temp.df <- list1[[i]]
G <- temp.df[1,]$gyear
gid <- temp.df[1,]$id
# If not treated keep
if (G==0){
to_keep <- c(to_keep, gid)
}else{
# when first and last treated?
first_obs <- head(temp.df[temp.df$obs,]$year,1)
last_obs <- tail(temp.df[temp.df$obs,]$year,1)
# at least one before and one at or after?
if (G>first_obs & G<=last_obs){
to_keep <- c(to_keep, gid)
}
}
}
# only keep those wanted
df5 <- df5[df5$id %in% to_keep,]