Search code examples
rfiltercounttop-n

Select the n most frequent values in a variable


I would like to find the most common values in a column in a data frame. I assume using table would be the best way to do this? I then want to filter/subset my data frame to only include these top-n values.

An example of my data frame is as follows. Here I want to find e.g. the top 2 IDs.

ID    col
A     blue
A     purple
A     green
B     green
B     red
C     red
C     blue
C     yellow
C     orange

I therefore want to output the following:

Top 2 values of ID are:
A and C

I will then select the rows corresponding to ID A and C:

ID    col
A     blue
A     purple
A     green
C     red
C     blue
C     yellow
C     orange

Solution

  • We can count the number of values using table, sort them in decreasing order and select first 2 (or 10) values, get the corresponding ID's and subset those ID's from the data frame.

    df[df$ID %in% names(sort(table(df$ID), decreasing = TRUE)[1:2]), ]
    
    #  ID    col
    #1  A   blue
    #2  A purple
    #3  A  green
    #6  C    red
    #7  C   blue
    #8  C yellow
    #9  C orange