Search code examples
rdata.tablesubsetr-factor

Remove rows based on factor-levels


I have a data.frame df in format "long".

df <- data.frame(site = rep(c("A","B","C"), 1, 7),
                 time = c(11,11,11,22,22,22,33),
                 value = ceiling(rnorm(7)*10))
df <- df[order(df$site), ]

df
  site time value
1    A   11    12
2    A   22   -24
3    A   33   -30
4    B   11     3
5    B   22    16
6    C   11     3
7    C   22     9

Question

How do I remove the rows where an unique element of df$time is not present for each of the levels of df$site ?

In this case I want to remove df[3,], because for df$time the timestamp 33 is only present for site A and not for site B and site C.

Desired output:

df.trimmed
  site time value
1    A   11    12
2    A   22   -24
4    B   11     3
5    B   22    16
6    C   11     3
7    C   22     9

The data.frame has easily 800k rows and 200k unique timestamps. I don't want to use loops but I don't know how to use vectorized functions like apply() or lapply() for this case.


Solution

  • Here's another possible solution using the data.table package:

    unTime <- unique(df$time)
    
    library(data.table)
    
    DT <- data.table(df, key = "site")
    
    (notInAll <- unique(DT[, list(ans = which(!unTime %in% time)), by = key(DT)]$ans))
    # [1] 3
    
    DT[time %in% unTime[-notInAll]]
    
    #      site time value
    # [1,]    A   11     3
    # [2,]    A   22    11
    # [3,]    B   11    -6
    # [4,]    B   22    -2
    # [5,]    C   11   -19
    # [6,]    C   22   -14
    

    EDIT from Matthew
    Nice. Or a slightly more direct way :

    DT = as.data.table(df)
    tt = DT[,length(unique(site)),by=time]
    tt
       time V1
    1:   11  3
    2:   22  3
    3:   33  1
    
    tt = tt[V1==max(V1)]      # See * below
    tt
       time V1
    1:   11  3
    2:   22  3
    
    DT[time %in% tt$time]
       site time value
    1:    A   11     7
    2:    A   22    -2
    3:    B   11     8
    4:    B   22   -10
    5:    C   11     3
    6:    C   22     1
    

    In case no time is present in all sites, when final result should be empty (as Ben pointed out in comments), the step marked * above could be :

    tt = tt[V1==length(unique(DT$site))]