Search code examples
rdataframesubsetradix

A faster conditional subset


I'm trying to modify my dataframe according to the value in one column and having the most common value in another column.

df <- data.frame(points=c(1, 2, 4, 3, 4, 8, 3, 3, 2),
                 assists=c(6, 6, 5, 6, 6, 9, 9, 1, 1),
                 team=c('A', 'A', 'A', 'A', 'A', 'C', 'C', 'C', 'C'))

  points assists team
1      1       6    A
2      2       6    A
3      4       5    A
4      3       6    A
5      4       6    A
6      8       9    C
7      3       9    C
8      3       1    C
9      2       1    C

to look like this:

df2 <- data.frame(points=c(1, 2, 3, 4, 8, 3),
                 assists=c(6, 6, 6, 6, 1, 1),
                 team=c('A', 'A', 'A', 'A', 'C', 'C'))

  points assists team
1      1       6    A
2      2       6    A
3      3       6    A
4      4       6    A
5      8       1    C
6      3       1    C

The goal is to keep all rows that have the values A and C in the "team" column as long as in the "assists" column the most common value ("6" for "A" ) is kept. If there is a tie (such as "9" and "1" for "C") the last most common value should be kept.

I do this with a for loop but my dataframe has 3,000,000 rows and the process was very slow. Does anyone know a faster alternative?


Solution

  • We could modify the Mode function and do a group by approach to filter

    library(dplyr)
     Mode <- function(x) {
       # get the unique elements
       ux <- unique(x)
       # convert to integer index with match and get the frequency
       # tabulate should be faster than table
       freq <- tabulate(match(x, ux))
       # use == on the max of the freq, get the corresponding ux
       # then get the last elements of ux
       last(ux[freq == max(freq)])
     }
    df %>% 
       # grouped by team
       group_by(team) %>% 
       # filter only those assists that are returned from Mode function
       filter(assists %in% Mode(assists)) %>%
       ungroup
    

    -output

    # A tibble: 6 × 3
      points assists team 
       <dbl>   <dbl> <chr>
    1      1       6 A    
    2      2       6 A    
    3      3       6 A    
    4      4       6 A    
    5      3       1 C    
    6      2       1 C
    

    Or may use data.table methods for a faster execution

    library(data.table)
    # setDT - converts data.frame to data.table
    # create a frequency column (`.N`) by assignment (`:=`)
    # grouped by team, assists columns
    setDT(df)[, N := .N, by = .(team, assists)]
    # grouped by team, get the index of the max N from reverse (`.N:1`)
    #subset the assists with that index
    # create a logical vector with %in%
    # get the row index -.I, which creates a default column V1
    # extract the column ($V1) and use that to subset the data
    df[df[, .I[assists %in% assists[.N - which.max(N[.N:1]) + 1]],
         by = team]$V1][, N := NULL][]
       points assists   team
        <num>   <num> <char>
    1:      1       6      A
    2:      2       6      A
    3:      3       6      A
    4:      4       6      A
    5:      3       1      C
    6:      2       1      C