Search code examples
rdplyrgreatest-n-per-group

Unexpected output of dplyr::top_n


This is the expected output of dplyr::top_n!

To select Top 2

> mtcars %>% dplyr::arrange(desc(mpg)) %>% dplyr::top_n(2, mpg)

                mpg cyl disp hp drat    wt  qsec vs am gear carb
Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
Fiat 128       32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1

To select Top 3

> mtcars %>% dplyr::arrange(desc(mpg)) %>% dplyr::top_n(3, mpg)
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

But why is that, when I select Top 4 ??

> mtcars %>% dplyr::arrange(desc(mpg)) %>% dplyr::top_n(4, mpg)
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

I expected this

                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2
Fiat X1-9      27.3   4 79.0  66 4.08 1.935 18.90  1  1    4    1

Can anybody please explain, what I am missing?


Solution

  • top_n is superseded and should not be used, use slice_max instead.

    That said, slice_max(mtcars, mpg, n = 4) will give the same result than top_n(mtcars, mpg, n = 3). This is because, under the hood, they use dplyr::min_rank to calculate ranks. slice_max(mtcars, mpg, n = 4) is equivalent to mtcars %>% filter(min_rank(desc(mpg)) <= 4).

    min_rank handles ties like so (see ?min_rank):

    min_rank() gives every tie the same (smallest) value so that c(10, 20, 20, 30) gets ranks c(1, 2, 2, 4). It's the way that ranks are usually computed in sports and is equivalent to rank(ties.method = "min").

    In your case of n = 4, the prompt returns 4 rows, because that's what it should return. min_rank(desc(c(33.9, 32.4, 30.4, 30.4, 27.3))) returns 1 2 3 3 5, hence the fifth observation is indeed <= 4.


    How to get the wanted result? You can use dense_rank to do so, which has another way of evaluating ties by removing integer gaps between ranks.

    mtcars %>% filter(dense_rank(desc(mpg)) <= 4)
    
    #                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
    # Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
    # Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
    # Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
    # Fiat X1-9      27.3   4 79.0  66 4.08 1.935 18.90  1  1    4    1
    # Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2