Search code examples
rdataframesqldf

Select the first row by group


From a dataframe like this

test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10])
test <- test[order(test$id), ]
rownames(test) <- 1:10

> test
    id string
 1   1      A
 2   1      F
 3   2      B
 4   2      G
 5   3      C
 6   3      H
 7   4      D
 8   4      I
 9   5      E
 10  5      J

I want to create a new one with the first row of each id / string pair. If sqldf accepted R code within it, the query could look like this:

res <- sqldf("select id, min(rownames(test)), string 
              from test 
              group by id, string")

> res
    id string
 1   1      A
 3   2      B
 5   3      C
 7   4      D
 9   5      E

Is there a solution short of creating a new column like

test$row <- rownames(test)

and running the same sqldf query with min(row)?


Solution

  • You can use duplicated to do this very quickly.

    test[!duplicated(test$id),]
    

    Benchmarks, for the speed freaks:

    ju <- function() test[!duplicated(test$id),]
    gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1))
    gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
    jply <- function() ddply(test,.(id),function(x) head(x,1))
    jdt <- function() {
      testd <- as.data.table(test)
      setkey(testd,id)
      # Initial solution (slow)
      # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]
      # Faster options :
      testd[!duplicated(id)]               # (1)
      # testd[, .SD[1L], by=key(testd)]    # (2)
      # testd[J(unique(id)),mult="first"]  # (3)
      # testd[ testd[,.I[1L],by=id] ]      # (4) needs v1.8.3. Allows 2nd, 3rd etc
    }
    
    library(plyr)
    library(data.table)
    library(rbenchmark)
    
    # sample data
    set.seed(21)
    test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
    test <- test[order(test$id), ]
    
    benchmark(ju(), gs1(), gs2(), jply(), jdt(),
        replications=5, order="relative")[,1:6]
    #     test replications elapsed relative user.self sys.self
    # 1   ju()            5    0.03    1.000      0.03     0.00
    # 5  jdt()            5    0.03    1.000      0.03     0.00
    # 3  gs2()            5    3.49  116.333      2.87     0.58
    # 2  gs1()            5    3.58  119.333      3.00     0.58
    # 4 jply()            5    3.69  123.000      3.11     0.51
    

    Let's try that again, but with just the contenders from the first heat and with more data and more replications.

    set.seed(21)
    test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
    test <- test[order(test$id), ]
    benchmark(ju(), jdt(), order="relative")[,1:6]
    #    test replications elapsed relative user.self sys.self
    # 1  ju()          100    5.48    1.000      4.44     1.00
    # 2 jdt()          100    6.92    1.263      5.70     1.15