In R, say I have a data frame times of times with columns: athlete (character), season (integer), distance (factor, out of 400, 800, 1500, 5000, 10000) and tm (float) and want to identify the indices of rows that are the lowest value of time for each unique combination of the other four variables.
I can do this with the following code that sorts by grouping columns and then by tm:
times1 <- times # make copy of array
times1$rownum <- 1:nrow(times1) # add column of row numbers
times1 <- times1[with(times1, order(athlete, season, distance, tm)), ] # sort array
whichmins <- times1$rownum[!duplicated(subset(times1, select = -c(tm, rownum)))] # identify rows where grouping factors change
But I was wondering if there was a more concise way to do it using aggregate, dplyr or data tables. I tried using dplyr's group_by function with which.min but I could not get it to work.
Thank you
With tidyverse
, similar approach would be to arrange
by the columns, filter
the distinct elements based on the logical vector from duplicated
and pull
the 'rownum'
library(dplyr)
times %>%
mutate(rownum = row_number()) %>%
arrange(athlete, season, distance, tm) %>%
filter(!duplicated(select(., -c(tm, rownum))) %>%
pull(rownum)
Or instead of duplicated
, use the distinct
times %>%
mutate(rownum = row_number()) %>%
arrange(athlete, season, distance, tm) %>%
distinct(across(-c(tm, rownum)), .keep_all = TRUE) %>%
pull(rownum)
If we want to use a group by operation, then after the grouping by 'athlete', 'season', 'distance', slice the row where the 'tm' is min
imum and pull
the 'rownum'
times %>%
mutate(rownum = row_number())
group_by(athlete, season, distance) %>%
slice_min(tm) %>%
pull(rownum)
Or with summarise
times %>%
mutate(rownum = row_number())
group_by(athlete, season, distance) %>%
summarise(rownum = rownum[which.min(tm)]) %>%
pull(rownum)
Or using data.table
library(data.table)
setDT(times)[order(athlete, season, distance, tm),
.I[!duplicated(.SD[, setdiff(names(.SD), 'tm')), with = FALSE])]]
Or with unique
unique(setorder(setorder(setDT(times, keep.rownames = TRUE),
athlete, season, distance, tm), by = c('athlete', 'season', 'distance'))[, rn]