Search code examples
rrankingranking-functions

Ranking variables with conditions


Say I have the following data frame:

df <- data.frame(store       = LETTERS[1:8],
                 sales       = c(  9, 128,  54,  66,  23, 132,  89,  70),
                 successRate = c(.80, .25, .54, .92, .85, .35, .54, .46))

I want to rank the stores according to successRate, with ties going to the store with more sales, so first I do this (just to make visualization easier):

df <- df[order(-df$successRate, -df$sales), ]

In order to actually create a ranking variable, I do the following:

df$rank <- ave(df$successRate, FUN = function(x) rank(-x, ties.method='first'))

So df looks like this:

  store sales successRate rank
4     D    66        0.92    1
5     E    23        0.85    2
1     A     9        0.80    3
7     G    89        0.54    4
3     C    54        0.54    5
8     H    70        0.46    6
6     F   132        0.35    7
2     B   128        0.25    8

The problem is I don't want small stores to be part of the ranking. Specifically, I want stores with less than 50 sales not to be ranked. So this is how I define df$rank instead:

df$rank <- ifelse(df$sales < 50, NA,
                  ave(df$successRate, FUN = function(x) rank(-x, ties.method='first')))

The problem is that even though this correctly removes stores E and A, it doesn't reassign the rankings they were occupying. df looks like this now:

  store sales successRate rank
4     D    66        0.92    1
5     E    23        0.85   NA
1     A     9        0.80   NA
7     G    89        0.54    4
3     C    54        0.54    5
8     H    70        0.46    6
6     F   132        0.35    7
2     B   128        0.25    8

I've experimented with conditions inside and outside ave(), but I can'r get R to do what I want! How can I get it to rank the stores like this?

  store sales successRate rank
4     D    66        0.92    1
5     E    23        0.85   NA
1     A     9        0.80   NA
7     G    89        0.54    2
3     C    54        0.54    3
8     H    70        0.46    4
6     F   132        0.35    5
2     B   128        0.25    6

Solution

  • Super easy to do with data.table:

    library(data.table)
    dt = data.table(df)
    
    # do the ordering you like (note, could also use setkey to do this faster)
    dt = dt[order(-successRate, -sales)]
    
    dt[sales >= 50, rank := .I]
    dt
    #   store sales successRate rank
    #1:     D    66        0.92    1
    #2:     E    23        0.85   NA
    #3:     A     9        0.80   NA
    #4:     G    89        0.54    2
    #5:     C    54        0.54    3
    #6:     H    70        0.46    4
    #7:     F   132        0.35    5
    #8:     B   128        0.25    6
    

    If you must do it in data.frame, then after your preferred order, run:

    df$rank <- NA
    df$rank[df$sales >= 50] <- seq_len(sum(df$sales >= 50))