Search code examples
rdplyrhierarchical-datapanel-data

How to filter out one variable conditioned on the change in itself and another variable?


I am trying to clean my age variable from data entry discrepancies in a panel data that follow individuals over time. Many respondents have a jump in their age from one observation to another because they have missed a few waves and then came back as we can see for the persons below with ID 1 and 2. However, the person with ID 3 had a jump in age that is not equal to the year that s/he was out of the panel.

Could someone please guide me on how to filter out respondents from my data that have unreasonable change in their age that is not equal to the number of years they were out of the panel but to other reasons such as data entry issues?

id  year    age
1   2005    50
1   2006    51
1   2010    55
2   2002    38
2   2005    41
2   2006    42
3   2006    30
3   2009    38
3   2010    39

structure(list(id = structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3), format.stata = "%9.0g"), 
    year = structure(c(2005, 2006, 2010, 2002, 2005, 2006, 2006, 
    2009, 2010), format.stata = "%9.0g"), age = structure(c(50, 
    51, 55, 38, 41, 42, 30, 38, 39), format.stata = "%9.0g")), row.names = c(NA, 
-9L), class = c("tbl_df", "tbl", "data.frame"))

Solution

  • We can use diff

    library(dplyr)
    df %>%
        group_by(id) %>% 
        filter(!all(diff(year) == diff(age)))
    

    -output

    # A tibble: 3 x 3
    # Groups:   id [1]
    #     id  year   age
    #  <dbl> <dbl> <dbl>
    #1     3  2006    30
    #2     3  2009    38
    #3     3  2010    39