Search code examples
rsortingdataframesequenceranking

Ranking subsets of a data frame in R for large datasets


I have the following dataset and I need to trace the sequence of locations where each user was for each day of the year.

   User    Date        Location     Time
    90   2013-01-28       39      16:06:20
    26   2013-02-04       27      19:32:09
    23   2013-02-04        5      16:03:39
    23   2013-01-07       29      15:40:25
    84   2013-02-27       50      17:25:40
    57   2013-01-30        5      17:26:26

I modified the script used in the following thread: Ranking subsets of a data frame in R

The modified code is the following:

data$User <- as.factor(data$User)
data$Date <- as.factor(data$Date)

data$Sequence <- ave(data$Time, data$User, data$Date, FUN=rank) 

data <- data[order(data$Sequence),]
data <- data[order(data$User),]
data <- data[order(data$Date),]

And the result:

   User    Date        Location     Time   Sequence
    3    2013-01-01       29      18:47:31    1
    4    2013-01-01       18      07:00:21    1
    4    2013-01-01       37      07:16:19    2
    4    2013-01-01       11      08:28:37    3
    6    2013-01-01        6      07:17:05    1
    6    2013-01-01       34      08:10:38    2

However, while it works for small dataframes, it takes a inordinate amount of time to run on the real dataset (5M rows with almost 100K individual users).

Is there a more efficient way to do this?


Solution

  • For larger data.frames, my experience is that ave can get pretty slow.

    Your biggest speed up will probably be with switching to data.table:

    # load data.table package
    library(data.table)
    # convert data.frame into data.table
    setDT(data)
    
    # get ranks and sort
    data[, Sequence := rank(Time), by=.(User, Date)][order(Sequence, User, Date),]
    

    This package is optimized for speed with large data.frames. Also, as you can see, it allows you to combine processes into one line, which can be pretty handy.