Search code examples
rformattingpaneldata-cleaningrowdeleting

How to delete observations based on changes in specific variable?


I have a balanced panel data table containing thousands of firms with observations in two years each (1 & 2). For the data cleaning process I need to exclude observations where there is a shift between the years in a supposedly time-constant variable.

example <- matrix(c(1,1,2,2,3,3,4,4,1,2,1,2,1,2,1,2,1,1,0,0,0,1,1,0), ncol=3)
colnames(example) <- c('id', 'year', 'supposedly time-constant')
example.table <- data.table(example)
example.table

   id year supposedly time-constant
1:  1    1                        1
2:  1    2                        1
3:  2    1                        0
4:  2    2                        0
5:  3    1                        0
6:  3    2                        1
7:  4    1                        1
8:  4    2                        0

So, in the above table, firms 3 & 4 both show a change in the supposedly time-constant variable, so they would need to be removed. Firms 1 & 2 are what I wish to have. I need a code/function that allows me to clean my data.

I seem to have reached the limits of my R knowledge and hope I can find help here - thanks in advance!


Solution

  • We can use dplyr and select groups which have only one unique value

    library(dplyr)
    example.table %>%
       group_by(id) %>%
       filter(n_distinct(`supposedly time-constant`) == 1)
    
    
    #     id  year `supposedly time-constant`
    #  <dbl> <dbl>                      <dbl>
    #1     1     1                          1
    #2     1     2                          1
    #3     2     1                          0
    #4     2     2                          0
    

    Same logic in base R using ave would be

    example.table[with(example.table, ave(`supposedly time-constant`, id, 
                 FUN = function(x) length(unique(x))) == 1), ]