Search code examples
rgreatest-n-per-group

R: How to identify indices of minima of all groups in data frame


In R, say I have a data frame times of times with columns: athlete (character), season (integer), distance (factor, out of 400, 800, 1500, 5000, 10000) and tm (float) and want to identify the indices of rows that are the lowest value of time for each unique combination of the other four variables.

I can do this with the following code that sorts by grouping columns and then by tm:

times1 <- times # make copy of array
times1$rownum <- 1:nrow(times1) # add column of row numbers
times1 <- times1[with(times1, order(athlete, season, distance, tm)), ] # sort array
whichmins <- times1$rownum[!duplicated(subset(times1, select = -c(tm, rownum)))] # identify rows where grouping factors change

But I was wondering if there was a more concise way to do it using aggregate, dplyr or data tables. I tried using dplyr's group_by function with which.min but I could not get it to work.

Thank you


Solution

  • With tidyverse, similar approach would be to arrange by the columns, filter the distinct elements based on the logical vector from duplicated and pull the 'rownum'

    library(dplyr)
    times %>%
         mutate(rownum = row_number()) %>%
         arrange(athlete, season, distance, tm) %>%
         filter(!duplicated(select(., -c(tm, rownum))) %>%
         pull(rownum)   
    

    Or instead of duplicated, use the distinct

    times %>%
       mutate(rownum = row_number()) %>%
       arrange(athlete, season, distance, tm) %>%
       distinct(across(-c(tm, rownum)), .keep_all = TRUE) %>%
       pull(rownum)   
    

    If we want to use a group by operation, then after the grouping by 'athlete', 'season', 'distance', slice the row where the 'tm' is minimum and pull the 'rownum'

    times %>%
        mutate(rownum = row_number())
        group_by(athlete, season, distance) %>%
        slice_min(tm) %>%
        pull(rownum)
    

    Or with summarise

    times %>%
        mutate(rownum = row_number())
        group_by(athlete, season, distance) %>%
        summarise(rownum = rownum[which.min(tm)]) %>%
        pull(rownum)
    

    Or using data.table

    library(data.table)
    setDT(times)[order(athlete, season, distance, tm), 
       .I[!duplicated(.SD[, setdiff(names(.SD), 'tm')), with = FALSE])]]
    

    Or with unique

    unique(setorder(setorder(setDT(times, keep.rownames = TRUE),
          athlete, season, distance, tm), by = c('athlete', 'season', 'distance'))[, rn]