Search code examples
rdata.tablecharacterlagdcast

R - filter observations that remain the same over multiple years


I have prepared an example data.table:

testTable <- data.table(years = rep(c(rep((2014),3),rep((2015),3), rep((2016),3)), 2), 
                        policy = c(rep("A", 9), rep("B",9)),
                        destination = rep(c("Paris", "London", "Berlin"), 6))

testTable[c(1,5,8), destination := c("Moskaw", "Milano", "Valencia")]

> testTable
    years policy destination
 1:  2014      A      Moskaw
 2:  2014      A      London
 3:  2014      A      Berlin
 4:  2015      A       Paris
 5:  2015      A      Milano
 6:  2015      A      Berlin
 7:  2016      A       Paris
 8:  2016      A    Valencia
 9:  2016      A      Berlin
10:  2014      B       Paris
11:  2014      B      London
12:  2014      B      Berlin
13:  2015      B       Paris
14:  2015      B      London
15:  2015      B      Berlin
16:  2016      B       Paris
17:  2016      B      London
18:  2016      B      Berlin

Here, i only want to keep observations that have the same destination over all years available in the data. In this example i have chosen policies with 3 years only, but the real data may as well have 2,3 and 4 years of history mixed in a single data.table.

The desired result is:

> testTable
    years policy destination
 3:  2014      A      Berlin
 6:  2015      A      Berlin
 9:  2016      A      Berlin
10:  2014      B       Paris
11:  2014      B      London
12:  2014      B      Berlin
13:  2015      B       Paris
14:  2015      B      London
15:  2015      B      Berlin
16:  2016      B       Paris
17:  2016      B      London
18:  2016      B      Berlin

Any ides?

I tried using dcast(), and then i wanted to filter those rows that have the same entries in all columns after policy, however i realised that dcast() automatically converts my character variable destination into numeric and aggregates my data using length:

Aggregate function missing, defaulting to 'length'

NOTE: my data will have couple of hundreds of observations.


Solution

  • We can filter those common destination which is present in each years for each policy.

    library(data.table)
    
    testTable[testTable[, destination %in% 
                          Reduce(intersect, split(destination, years)), policy]$V1]
    
    
    #    years policy destination
    # 1:  2014      A      Berlin
    # 2:  2015      A      Berlin
    # 3:  2016      A      Berlin
    # 4:  2014      B       Paris
    # 5:  2014      B      London
    # 6:  2014      B      Berlin
    # 7:  2015      B       Paris
    # 8:  2015      B      London
    # 9:  2015      B      Berlin
    #10:  2016      B       Paris
    #11:  2016      B      London
    #12:  2016      B      Berlin
    

    and in dplyr :

    library(dplyr)
    
    testTable %>%
      group_by(policy) %>%
      filter(destination %in% Reduce(intersect, split(destination, years)))