I have joined two dfs each with their own unique ID variables: topic and index. There are 54 unique topic IDs and 54 unique index IDs for a total of 2916 obs in the df. Each obs features a numeric score in a variable called value. I would like to subset to a df of 54 obs that features the topic-index pairings with the highest value while avoiding the repetition of any topic or index. For example, in the sample below, index 349 is repeated in the first two rows for topic 33 and topic 2. I'd like to keep index 349 assigned to topic 33, but then topic 2 would be assigned to the index with the next highest value which is index 347 (row 4 in the sample). How do I accomplish this in code for the entire dataframe?
SAMPLE
df <- structure(list(topic = c(33L, 2L, 33L, 2L, 33L, 13L, 33L, 2L,
2L, 2L, 42L, 13L, 33L), index = c(349, 349, 363, 347, 342, 369,
321, 366, 321, 363, 344, 370, 366), value = c(0.210311631079167,
0.204938177956459, 0.201678820628508, 0.160801031631647, 0.160747075179686,
0.154814646522019, 0.154102617910918, 0.137730410377001, 0.126294470150952,
0.123695668664189, 0.110965846294849, 0.0999091218902647, 0.099824248465453
)), row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"
))
Desired output
output <- structure(list(topic = c(33L, 2L, 13L, 42L), index = c(349, 347,
369, 344), value = c(0.210311631079167, 0.160801031631647, 0.154814646522019,
0.110965846294849)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
This code is obviously insufficient (and my sample above doesn't have 54 topics):
df2 <- df %>% group_by(topic, index) %>% arrange(-value) %>% filter(top_n(54))
POSSIBLE SOLUTION:
A friend helped out with a possible function for this:
find_pairs <- function(df){
store_max <- NULL
search_data <- df
for(i in 1:length(unique(df$topic))) {
max_ind <- which.max(search_data$value)
new_best <- search_data[max_ind,]
store_max <- rbind(store_max, new_best)
search_data <- search_data %>%
filter(topic != new_best$topic & index != new_best$index)
}
store_max
}