I have a balanced panel data table containing thousands of firms with observations in two years each (1 & 2). For the data cleaning process I need to exclude observations where there is a shift between the years in a supposedly time-constant variable.
example <- matrix(c(1,1,2,2,3,3,4,4,1,2,1,2,1,2,1,2,1,1,0,0,0,1,1,0), ncol=3)
colnames(example) <- c('id', 'year', 'supposedly time-constant')
example.table <- data.table(example)
example.table
id year supposedly time-constant
1: 1 1 1
2: 1 2 1
3: 2 1 0
4: 2 2 0
5: 3 1 0
6: 3 2 1
7: 4 1 1
8: 4 2 0
So, in the above table, firms 3 & 4 both show a change in the supposedly time-constant variable, so they would need to be removed. Firms 1 & 2 are what I wish to have. I need a code/function that allows me to clean my data.
I seem to have reached the limits of my R knowledge and hope I can find help here - thanks in advance!
We can use dplyr
and select groups which have only one unique value
library(dplyr)
example.table %>%
group_by(id) %>%
filter(n_distinct(`supposedly time-constant`) == 1)
# id year `supposedly time-constant`
# <dbl> <dbl> <dbl>
#1 1 1 1
#2 1 2 1
#3 2 1 0
#4 2 2 0
Same logic in base R using ave
would be
example.table[with(example.table, ave(`supposedly time-constant`, id,
FUN = function(x) length(unique(x))) == 1), ]