Search code examples
rsubset

Subset a data frame based on column entry (or rank)


I have a data.frame as simple as this one:

id group idu  value
1  1     1_1  34
2  1     2_1  23
3  1     3_1  67
4  2     4_2  6
5  2     5_2  24
6  2     6_2  45
1  3     1_3  34
2  3     2_3  67
3  3     3_3  76

from where I want to retrieve a subset with the first entries of each group; something like:

id group idu value
1  1     1_1 34
4  2     4_2 6
1  3     1_3 34

id is not unique so the approach should not rely on it.

Can I achieve this avoiding loops?

data <- data.frame(
  id = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L),
  group = rep(1:3, each = 3L),
  idu = factor(c("1_1", "2_1", "3_1", "4_2", "5_2", "6_2", "1_3", "2_3", "3_3")),
  value = c(34L, 23L, 67L, 6L, 24L, 45L, 34L, 67L, 76L)
)

Solution

  • Using Gavin's million row df:

    DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
                      group = factor(rep(1:1000, each = 1000)),
                      value = runif(1000000))
    DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))
    

    I think the fastest way is to reorder the data frame and then use duplicated:

    system.time({
      DF4 <- DF3[order(DF3$group), ]
      out2 <- DF4[!duplicated(DF4$group), ]
    })
    # user  system elapsed 
    # 0.335   0.107   0.441
    

    This compares to 7 seconds for Gavin's fastet lapply + split method on my computer.

    Generally, when working with data frames, the fastest approach is usually to generate all the indices and then do a single subset.